Machine Learning and Data Mining(机器学习与数据挖掘)

时间:2022-09-09 11:07:43

Machine Learning and Data Mining(机器学习与数据挖掘)Machine Learning and Data Mining(机器学习与数据挖掘)

Problems[show]

Classification

Clustering

Regression

Anomaly detection

Association rules

Reinforcement learning

Structured prediction

Feature engineering

Feature learning

Online learning

Semi-supervised learning

Unsupervised learning

Learning to rank

Grammar induction

Supervised learning
(classification • regression)

Decision trees

Ensembles (Bagging, Boosting, Random
forest
)

k-NN

Linear
regression

Naive Bayes

Neural networks

Logistic regression

Perceptron

Relevance vector machine (RVM)

Support vector machine (SVM)

Clustering[show]

BIRCH

CURE

Hierarchical

k-means

Expectation–maximization (EM)

DBSCAN

OPTICS

Mean-shift

Dimensionality reduction[show]

Factor analysis

CCA

ICA

LDA

NMF

PCA

t-SNE

Structured prediction[show]

Graphical models (Bayes net, CRF, HMM)

Anomaly detection[show]

k-NN

Local outlier factor

Neural nets[show]

Autoencoder

Deep learning

Multilayer perceptron

RNN

Restricted Boltzmann machine

SOM

Convolutional neural network

Reinforcement learning[show]

Q-learning

SARSA

Temporal difference (TD)

Theory[show]

Bias-variance dilemma

Computational learning theory

Empirical risk minimization

Occam learning

PAC learning

Statistical learning

VC theory

Machine-learning venues[show]

NIPS

ICML

ML

JMLR

ArXiv:cs.LG

Related articles[show]

Outline of machine learning

Machine learning portal

1. Problems问题

Classification

In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into "spam" or "non-spam" classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition.

In the terminology of machine learning,[1] classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance.

Often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or features. These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type), ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of occurrences of a particular word in an email) or real-valued (e.g. a measurement of blood pressure). Other classifiers work by comparing observations to previous observations by means of a similarity or distance function.

An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category.

Terminology across fields is quite varied. In statistics, where classification is often done with logistic regression or a similar procedure, the properties of observations are termed explanatory variables (or independent variables, regressors, etc.), and the categories to be predicted are known as outcomes, which are considered to be possible values of the dependent variable. In machine learning, the observations are often known as instances, the explanatory variables are termed features (grouped into a feature vector), and the possible categories to be predicted are classes. Other fields may use different terminology: e.g. in community ecology, the term "classification" normally refers to cluster analysis, i.e. a type of unsupervised learning, rather than the supervised learning described in this article.

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances between cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including parameters such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.

Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology (from Greek βότρυς "grape") and typological analysis. The subtle differences are often in the use of the results: while in data mining, the resulting groups are the matter of interest, in automatic classification the resulting discriminative power is of interest.

Cluster analysis was originated in anthropology by Driver and Kroeber in 1932 and introduced to psychology by Zubin in 1938 and Robert Tryon in 1939[1][2] and famously used by Cattell beginning in 1943[3] for trait theory classification in personality psychology.

Regression

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors'). More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed.

Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, a function of the independent variables called the regression function is to be estimated. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the prediction of the regression function using a probability distribution. A related but distinct approach is necessary condition analysis[1] (NCA), which estimates the maximum (rather than average) value of the dependent variable for a given value of the independent variable (ceiling line rather than central line) in order to identify what value of the independent variable is necessary but not sufficient for a given value of the dependent variable.

Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable;[2] for example, correlation does not prove causation.

Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable;[2] for example, correlation does not prove causation.

The performance of regression analysis methods in practice depends on the form of the data generating process, and how it relates to the regression approach being used. Since the true form of the data-generating process is generally not known, regression analysis often depends to some extent on making assumptions about this process. These assumptions are sometimes testable if a sufficient quantity of data is available. Regression models for prediction are often useful even when the assumptions are moderately violated, although they may not perform optimally. However, in many applications, especially with small effects or questions of causality based on observational data, regression methods can give misleading results.[3][4]

In a narrower sense, regression may refer specifically to the estimation of continuous response (dependent) variables, as opposed to the discrete response variables used in classification.[5] The case of a continuous dependent variable may be more specifically referred to as metric regression to distinguish it from related problems.[6]

Anomaly detection

In data mining, anomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset.[1] Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions.[2]

In particular, in the context of abuse and network intrusion detection, the interesting objects are often not rare objects, but unexpected bursts in activity. This pattern does not adhere to the common statistical definition of an outlier as a rare object, and many outlier detection methods (in particular unsupervised methods) will fail on such data, unless it has been aggregated appropriately. Instead, a cluster analysis algorithm may be able to detect the micro clusters formed by these patterns.[3]

Three broad categories of anomaly detection techniques exist.[1] Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier (the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection). Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set, and then testing the likelihood of a test instance to be generated by the learnt model.

Association rules

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.[1]

Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński and Arun Swami[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule { o n i o n s , p o t a t o e s  } ⇒ { b u r g e r  }   {\displaystyle \{\mathrm {onions,potatoes} \}\Rightarrow \{\mathrm {burger} \}}  \{{\mathrm {onions,potatoes}}\}\Rightarrow \{{\mathrm {burger}}\} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements.

In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, continuous production, and bioinformatics. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions.

Reinforcement learning

Reinforcement learning (RL) is an area of machine learning inspired by behaviourist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. The problem, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics and genetic algorithms. In the operations research and control literature, the field where reinforcement learning methods are studied is called approximate dynamic programming[citation needed]. The problem has been studied in the theory of optimal control, though most studies are concerned with the existence of optimal solutions and their characterization, and not with the learning or approximation aspects[citation needed]. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality[citation needed].

In machine learning, the environment is typically formulated as a Markov decision process (MDP), as many reinforcement learning algorithms for this context utilize dynamic programming techniques.[1] The main difference between the classical techniques[which?] and reinforcement learning algorithms is that the latter do not need knowledge[vague] about the MDP and they target large MDPs where exact methods become infeasible[citation needed].

Reinforcement learning differs from standard supervised learning in that correct input/output pairs[clarification needed] are never presented, nor sub-optimal actions explicitly corrected. Instead the focus is on on-line performance[clarification needed], which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).[2] The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.[citation needed]

Structured prediction

Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values.[1]

For example, the problem of translating a natural language sentence into a syntactic representation such as a parse tree can be seen as a structured prediction problem in which the structured output domain is the set of all possible parse trees.

Probabilistic graphical models form a large class of structured prediction models. In particular, Bayesian networks and random fields are popularly used to solve structured prediction problems in a wide variety of application domains including bioinformatics, natural language processing, speech recognition, and computer vision. Other algorithms and models for structured prediction include inductive logic programming, case-based reasoning, structured SVMs, Markov logic networks and constrained conditional models.

Similar to commonly used supervised learning techniques, structured prediction models are typically trained by means of observed data in which the true prediction value is used to adjust model parameters. Due to the complexity of the model and the interrelations of predicted variables the process of prediction using a trained model and of training itself is often computationally infeasible and approximate inference and learning methods are used.

Feature engineering

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning.

Feature engineering is an informal topic, but it is considered essential in applied machine learning.

Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering.

— Andrew Ng, Machine Learning and AI via Brain simulations[1]

Feature learning

In machine learning, feature learning or representation learning[1] is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

Feature learning is motivated by the fact that machine learning tasks such as classification often require input that is mathematically and computationally convenient to process. However, real-world data such as images, video, and sensor data has not yielded to attempts to algorithmically define specific features. An alternative is to discover such features or representations through examination, without relying on explicit algorithms.

Feature learning can be either supervised or unsupervised.

In supervised feature learning, features are learned using labeled input data. Examples include supervised neural networks, multilayer perceptron and (supervised) dictionary learning.

In unsupervised feature learning, features are learned with unlabeled input data. Examples include dictionary learning, independent component analysis, autoencoders, matrix factorization[2] and various forms of clustering.[3][4][5]

Online learning

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update our best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g. stock price prediction. Online learning algorithms may be prone to catastrophic interference. This problem is tackled by incremental learning approaches.

Semi-supervised learning

Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data. Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D structure of a protein or determining whether there is oil at a particular location). The cost associated with the labeling process thus may render a fully labeled training set infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoretical interest in machine learning and as a model for human learning.

As in the supervised learning framework, we are given a set of l   {\displaystyle l}  l independently identically distributed examples x 1   , … , x l   ∈ X   {\displaystyle x_{1},\dots ,x_{l}\in X}  x_{1},\dots ,x_{l}\in X with corresponding labels y 1   , … , y l   ∈ Y   {\displaystyle y_{1},\dots ,y_{l}\in Y}  y_{1},\dots ,y_{l}\in Y. Additionally, we are given u   {\displaystyle u}  u unlabeled examples x l + 1   , … , x l + u   ∈ X   {\displaystyle x_{l+1},\dots ,x_{l+u}\in X}  x_{l+1},\dots ,x_{l+u}\in X. Semi-supervised learning attempts to make use of this combined information to surpass the classification performance that could be obtained either by discarding the unlabeled data and doing supervised learning or by discarding the labels and doing unsupervised learning.

Semi-supervised learning may refer to either transductive learning or inductive learning. The goal of transductive learning is to infer the correct labels for the given unlabeled data x l + 1   , … , x l + u     {\displaystyle x_{l+1},\dots ,x_{l+u}}  x_{l+1},\dots ,x_{l+u} only. The goal of inductive learning is to infer the correct mapping from X   {\displaystyle X}  X to Y   {\displaystyle Y}  Y.

Intuitively, we can think of the learning problem as an exam and labeled data as the few example problems that the teacher solved in class. The teacher also provides a set of unsolved problems. In the transductive setting, these unsolved problems are a take-home exam and you want to do well on them in particular. In the inductive setting, these are practice problems of the sort you will encounter on the in-class exam.

It is unnecessary (and, according to Vapnik's principle, imprudent) to perform transductive learning by way of inferring a classification rule over the entire input space; however, in practice, algorithms formally designed for transduction or induction are often used interchangeably.

Unsupervised learning

Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from "unlabeled" data (a classification or categorization is not included in the observations). Since the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm—which is one way of distinguishing unsupervised learning from supervised learning and reinforcement learning.

A central case of unsupervised learning is the problem of density estimation in statistics,[1] though unsupervised learning encompasses many other problems (and solutions) involving summarizing and explaining key features of the data.

Learning to rank

Learning to rank[1] or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems.[2] Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item. The ranking model's purpose is to rank, i.e. produce a permutation of items in new, unseen lists in a way which is "similar" to rankings in the training data in some sense.

Grammar induction

Grammar induction (or grammatical inference[1]) is the process in machine learning of learning a formal grammar (usually as a collection of re-write rules or productions or alternatively as a finite state machine or automaton of some kind) from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.

2. Supervised learning
(classification • regression)

Decision trees

Decision tree
learning uses a decision tree (as a predictive model) to go from observations
about an item (represented in the branches) to conclusions about the item's
target value (represented in the leaves). It is one of the predictive modelling
approaches used in statistics, data mining and machine learning. Tree models
where the target variable can take a discrete set of values are called classification
trees; in these tree structures, leaves represent class labels and branches
represent conjunctions of features that lead to those class labels. Decision
trees where the target variable can take continuous values (typically real
numbers) are called regression trees.

In decision
analysis, a decision tree can be used to visually and explicitly represent
decisions and decision making. In data mining, a decision tree describes data
(but the resulting classification tree can be an input for decision making).
This page deals with decision trees in data mining.

Ensembles (BaggingBoostingRandom forest)

In statistics
and machine learning, ensemble methods use multiple learning algorithms to
obtain better predictive performance than could be obtained from any of the
constituent learning algorithms alone.[1][2][3] Unlike a statistical ensemble
in statistical mechanics, which is usually infinite, a machine learning
ensemble consists of only a concrete finite set of alternative models, but
typically allows for much more flexible structure to exist among those
alternatives.

k-NN

In pattern
recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric
method used for classification and regression.[1] In both cases, the input
consists of the k closest training examples in the feature space. The output
depends on whether k-NN is used for classification or regression:

In k-NN
classification, the output is a class membership. An object is classified by a
majority vote of its neighbors, with the object being assigned to the class
most common among its k nearest neighbors (k is a positive integer, typically
small). If k = 1, then the object is simply assigned to the class of that
single nearest neighbor.

In k-NN
regression, the output is the property value for the object. This value is the
average of the values of its k nearest neighbors.

k-NN is a type
of instance-based learning, or lazy learning, where the function is only
approximated locally and all computation is deferred until classification. The
k-NN algorithm is among the simplest of all machine learning algorithms.

Both for
classification and regression, a useful technique can be to assign weight to
the contributions of the neighbors, so that the nearer neighbors contribute
more to the average than the more distant ones. For example, a common weighting
scheme consists in giving each neighbor a weight of 1/d, where d is the
distance to the neighbor.[2]

The neighbors
are taken from a set of objects for which the class (for k-NN classification)
or the object property value (for k-NN regression) is known. This can be
thought of as the training set for the algorithm, though no explicit training
step is required.

A peculiarity of
the k-NN algorithm is that it is sensitive to the local structure of the
data.[citation needed] The algorithm is not to be confused with k-means,
another popular machine learning technique.

Linear regression

In statistics,
linear regression is a linear approach for modeling the relationship between a
scalar dependent variable y and one or more explanatory variables (or
independent variables) denoted X. The case of one explanatory variable is
called simple linear regression. For more than one explanatory variable, the
process is called multiple linear regression.[1] (This term is distinct from
multivariate linear regression, where multiple correlated dependent variables
are predicted, rather than a single scalar variable.)[2]

In linear
regression, the relationships are modeled using linear predictor functions
whose unknown model parameters are estimated from the data. Such models are
called linear models.[3] Most commonly, the conditional mean of y given the
value of X is assumed to be an affine function of X; less commonly, the median
or some other quantile of the conditional distribution of y given X is
expressed as a linear function of X. Like all forms of regression analysis,
linear regression focuses on the conditional probability distribution of y
given X, rather than on the joint probability distribution of y and X, which is
the domain of multivariate analysis.

Linear
regression was the first type of regression analysis to be studied rigorously,
and to be used extensively in practical applications.[4] This is because models
which depend linearly on their unknown parameters are easier to fit than models
which are non-linearly related to their parameters and because the statistical
properties of the resulting estimators are easier to determine.

Linear
regression has many practical uses. Most applications fall into one of the
following two broad categories:

If the goal is
prediction, or forecasting, or error reduction, linear regression can be used
to fit a predictive model to an observed data set of y and X values. After
developing such a model, if an additional value of X is then given without its
accompanying value of y, the fitted model can be used to make a prediction of
the value of y.

Given a variable
y and a number of variables X1, ..., Xp that may be related to y, linear
regression analysis can be applied to quantify the strength of the relationship
between y and the Xj, to assess which Xj may have no relationship with y at
all, and to identify which subsets of the Xj contain redundant information
about y.

Linear
regression models are often fitted using the least squares approach, but they
may also be fitted in other ways, such as by minimizing the "lack of
fit" in some other norm (as with least absolute deviations regression), or
by minimizing a penalized version of the least squares loss function as in
ridge regression (L2-norm penalty) and lasso (L1-norm penalty). Conversely, the
least squares approach can be used to fit models that are not linear models.
Thus, although the terms "least squares" and "linear model"
are closely linked, they are not synonymous.

Naive Bayes

In machine
learning, naive Bayes classifiers are a family of simple probabilistic
classifiers based on applying Bayes' theorem with strong (naive) independence
assumptions between the features.

Naive Bayes has
been studied extensively since the 1950s. It was introduced under a different
name into the text retrieval community in the early 1960s,[1]:488 and remains a
popular (baseline) method for text categorization, the problem of judging
documents as belonging to one category or the other (such as spam or
legitimate, sports or politics, etc.) with word frequencies as the features.
With appropriate pre-processing, it is competitive in this domain with more
advanced methods including support vector machines.[2] It also finds
application in automatic medical diagnosis.[3]

Naive Bayes
classifiers are highly scalable, requiring a number of parameters linear in the
number of variables (features/predictors) in a learning problem.
Maximum-likelihood training can be done by evaluating a closed-form
expression,[1]:718 which takes linear time, rather than by expensive iterative
approximation as used for many other types of classifiers.

In the statistics
and computer science literature, naive Bayes models are known under a variety
of names, including simple Bayes and independence Bayes.[4] All these names
reference the use of Bayes' theorem in the classifier's decision rule, but
naive Bayes is not (necessarily) a Bayesian method.[1][4]

Neural networks

Artificial
neural networks (ANNs) or connectionist systems are computing systems inspired
by the biological neural networks that constitute animal brains. Such systems
learn (progressively improve performance on) tasks by considering examples,
generally without task-specific programming. For example, in image recognition,
they might learn to identify images that contain cats by analyzing example
images that have been manually labeled as "cat" or "no cat"
and using the results to identify cats in other images. They do this without
any a priori knowledge about cats, e.g., that they have fur, tails, whiskers
and cat-like faces. Instead, they evolve their own set of relevant
characteristics from the learning material that they process.

An ANN is based
on a collection of connected units or nodes called artificial neurons
(analogous to biological neurons in an animal brain). Each connection
(analogous to a synapse) between artificial neurons can transmit a signal from
one to another. The artificial neuron that receives the signal can process it
and then signal artificial neurons connected to it.

In common ANN implementations,
the signal at a connection between artificial neurons is a real number, and the
output of each artificial neuron is calculated by a non-linear function of the
sum of its inputs. Artificial neurons and connections typically have a weight that
adjusts as learning proceeds. The weight increases or decreases the strength of
the signal at a connection. Artificial neurons may have a threshold such that
only if the aggregate signal crosses that threshold is the signal sent.
Typically, artificial neurons are organized in layers. Different layers may
perform different kinds of transformations on their inputs. Signals travel from
the first (input), to the last (output) layer, possibly after traversing the
layers multiple times.

The original
goal of the ANN approach was to solve problems in the same way that a human
brain would. Over time, attention focused on matching specific mental
abilities, leading to deviations from biology. ANNs have been used on a variety
of tasks, including computer vision, speech recognition, machine translation,
social network filtering, playing board and video games and medical diagnosis.

Logistic regression

In statistics,
logistic regression, or logit regression, or logit model[1] is a regression
model where the dependent variable (DV) is categorical. This article covers the
case of a binary dependent variable—that is, where the output can take only two
values, "0" and "1", which represent outcomes such as
pass/fail, win/lose, alive/dead or healthy/sick. Cases where the dependent
variable has more than two outcome categories may be analysed in multinomial
logistic regression, or, if the multiple categories are ordered, in ordinal
logistic regression.[2] In the terminology of economics, logistic regression is
an example of a qualitative response/discrete choice model.

Logistic
regression was developed by statistician David Cox in 1958.[2][3] The binary
logistic model is used to estimate the probability of a binary response based
on one or more predictor (or independent) variables (features). It allows one
to say that the presence of a risk factor increases the odds of a given outcome
by a specific factor.

Perceptron

In machine
learning, the perceptron is an algorithm for supervised learning of binary
classifiers (functions that can decide whether an input, represented by a
vector of numbers, belongs to some specific class or not).[1] It is a type of
linear classifier, i.e. a classification algorithm that makes its predictions
based on a linear predictor function combining a set of weights with the
feature vector. The algorithm allows for online learning, in that it processes
elements in the training set one at a time.

The perceptron
algorithm dates back to the late 1950s. Its first implementation, in custom
hardware, was one of the first artificial neural networks to be produced.

Relevance vector machine (RVM)

In mathematics,
a Relevance Vector Machine (RVM) is a machine learning technique that uses
Bayesian inference to obtain parsimonious solutions for regression and
probabilistic classification.[1] The RVM has an identical functional form to
the support vector machine, but provides probabilistic classification.

It is actually
equivalent to a Gaussian process model with covariance function:

where φ   {\displaystyle \varphi }  \varphi 
is the kernel function (usually Gaussian), α j     {\displaystyle \alpha _{j}}  \alpha _{j} are the variances of the prior on
the weight vector w ∼ N ( 0 , α − 1   I )  
{\displaystyle w\sim N(0,\alpha ^{-1}I)} 
w \sim N(0,\alpha^{-1}I), and x 
1   , … , x  N    
{\displaystyle \mathbf {x} _{1},\ldots ,\mathbf {x} _{N}}  \mathbf{x}_1,\ldots,\mathbf{x}_N are the
input vectors of the training set.[2]

Compared to that
of support vector machines (SVM), the Bayesian formulation of the RVM avoids
the set of free parameters of the SVM (that usually require
cross-validation-based post-optimizations). However RVMs use an expectation
maximization (EM)-like learning method and are therefore at risk of local minima.
This is unlike the standard sequential minimal optimization (SMO)-based
algorithms employed by SVMs, which are guaranteed to find a global optimum (of
the convex problem).

The relevance
vector machine is patented in the United States by Microsoft.[3]

Support vector machine (SVM)

In machine
learning, support vector machines (SVMs, also support vector networks[1]) are
supervised learning models with associated learning algorithms that analyze
data used for classification and regression analysis. Given a set of training
examples, each marked as belonging to one or the other of two categories, an
SVM training algorithm builds a model that assigns new examples to one category
or the other, making it a non-probabilistic binary linear classifier (although
methods such as Platt scaling exist to use SVM in a probabilistic
classification setting). An SVM model is a representation of the examples as
points in space, mapped so that the examples of the separate categories are
divided by a clear gap that is as wide as possible. New examples are then
mapped into that same space and predicted to belong to a category based on
which side of the gap they fall.

In addition to
performing linear classification, SVMs can efficiently perform a non-linear
classification using what is called the kernel trick, implicitly mapping their
inputs into high-dimensional feature spaces.

When data are
not labeled, supervised learning is not possible, and an unsupervised learning
approach is required, which attempts to find natural clustering of the data to
groups, and then map new data to these formed groups. The clustering algorithm
which provides an improvement to the support vector machines is called support
vector clustering[2] and is often[citation needed] used in industrial
applications either when data are not labeled or when only some data are
labeled as a preprocessing for a classification pass.

3. Clustering[show]

BIRCH

BIRCH (balanced
iterative reducing and clustering using hierarchies) is an unsupervised data
mining algorithm used to perform hierarchical clustering over particularly
large data-sets.[1] An advantage of BIRCH is its ability to incrementally and
dynamically cluster incoming, multi-dimensional metric data points in an
attempt to produce the best quality clustering for a given set of resources
(memory and time constraints). In most cases, BIRCH only requires a single scan
of the database.

Its inventors
claim BIRCH to be the "first clustering algorithm proposed in the database
area to handle 'noise' (data points that are not part of the underlying
pattern) effectively",[1] beating DBSCAN by two months. The algorithm
received the SIGMOD 10 year test of time award in 2006.[2]

CURE

CURE (Clustering
Using REpresentatives) is an efficient data clustering algorithm for large
databases that is more robust to outliers and identifies clusters having
non-spherical shapes and size variances.

Hierarchical

In data mining
and statistics, hierarchical clustering (also called hierarchical cluster
analysis or HCA) is a method of cluster analysis which seeks to build a
hierarchy of clusters. Strategies for hierarchical clustering generally fall
into two types:[1]

Agglomerative:
This is a "bottom up" approach: each observation starts in its own
cluster, and pairs of clusters are merged as one moves up the hierarchy.

Divisive: This
is a "top down" approach: all observations start in one cluster, and
splits are performed recursively as one moves down the hierarchy.

In general, the
merges and splits are determined in a greedy manner. The results of
hierarchical clustering are usually presented in a dendrogram.

In the standard
algorithm for hierarchical agglomerative clustering (HAC) has a time complexity
of O   ( n 3   )  
{\displaystyle {\mathcal {O}}(n^{3})} 
{\mathcal {O}}(n^{3}) and requires O  
( n 2   )   {\displaystyle {\mathcal {O}}(n^{2})}  {\mathcal {O}}(n^{2}) memory, which makes it
too slow for even medium data sets. However, for some special cases, optimal
efficient agglomerative methods (of complexity O   ( n 2  
)   {\displaystyle {\mathcal
{O}}(n^{2})}  {\mathcal {O}}(n^{2})) are
known: SLINK[2] for single-linkage and CLINK[3] for complete-linkage
clustering. With a heap the runtime of the general case can be reduced to O   ( n 2  
log ⁡ n )   {\displaystyle {\mathcal {O}}(n^{2}\log
n)}  {\mathcal
{O}}(n^{2}\log n) at the cost of further increasing the memory requirements. In
many programming languages, the memory overheads of this approach are too large
to make it practically usable.

Except for the
special case of single-linkage, none of the algorithms (except exhaustive
search in O   ( 2 n   )  
{\displaystyle {\mathcal {O}}(2^{n})} 
{\displaystyle {\mathcal {O}}(2^{n})}) can guaranteed to find the
optimum solution.

Divisive
clustering with an exhaustive search is O  
( 2 n   )   {\displaystyle {\mathcal {O}}(2^{n})}  {\displaystyle {\mathcal {O}}(2^{n})}, but it
is common to use faster heuristics to choose splits, such as k-means.

k-means

k-means
clustering is a method of vector quantization, originally from signal
processing, that is popular for cluster analysis in data mining. k-means
clustering aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a
prototype of the cluster. This results in a partitioning of the data space into
Voronoi cells.

The problem is
computationally difficult (NP-hard); however, there are efficient heuristic
algorithms that are commonly employed and converge quickly to a local optimum.
These are usually similar to the expectation-maximization algorithm for
mixtures of Gaussian distributions via an iterative refinement approach
employed by both k-means and Gaussian Mixture Modeling. Additionally, they both
use cluster centers to model the data; however, k-means clustering tends to
find clusters of comparable spatial extent, while the expectation-maximization
mechanism allows clusters to have different shapes.

The algorithm
has a loose relationship to the k-nearest neighbor classifier, a popular
machine learning technique for classification that is often confused with
k-means because of the k in the name. One can apply the 1-nearest neighbor
classifier on the cluster centers obtained by k-means to classify new data into
the existing clusters. This is known as nearest centroid classifier or Rocchio
algorithm.

Expectation–maximization
(EM)

In statistics,
an expectation–maximization (EM) algorithm is an iterative method to find
maximum likelihood or maximum a posteriori (MAP) estimates of parameters in
statistical models, where the model depends on unobserved latent variables. The
EM iteration alternates between performing an expectation (E) step, which
creates a function for the expectation of the log-likelihood evaluated using
the current estimate for the parameters, and a maximization (M) step, which
computes parameters maximizing the expected log-likelihood found on the E step.
These parameter-estimates are then used to determine the distribution of the
latent variables in the next E step.
DBSCAN

Density-based
spatial clustering of applications with noise (DBSCAN) is a data clustering
algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei
Xu in 1996.[1] It is a density-based clustering algorithm: given a set of
points in some space, it groups together points that are closely packed
together (points with many nearby neighbors), marking as outliers points that
lie alone in low-density regions (whose nearest neighbors are too far away).
DBSCAN is one of the most common clustering algorithms and also most cited in
scientific literature.[2]

In 2014, the
algorithm was awarded the test of time award (an award given to algorithms
which have received substantial attention in theory and practice) at the
leading data mining conference, KDD.[3]

OPTICS

Ordering points
to identify the clustering structure (OPTICS) is an algorithm for finding
density-based[1] clusters in spatial data. It was presented by Mihael Ankerst,
Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander.[2] Its basic idea is
similar to DBSCAN,[3] but it addresses one of DBSCAN's major weaknesses: the
problem of detecting meaningful clusters in data of varying density. In order
to do so, the points of the database are (linearly) ordered such that points
which are spatially closest become neighbors in the ordering. Additionally, a
special distance is stored for each point that represents the density that
needs to be accepted for a cluster in order to have both points belong to the
same cluster. This is represented as a dendrogram.

Mean-shift

Mean shift is a
non-parametric feature-space analysis technique for locating the maxima of a
density function, a so-called mode-seeking algorithm.[1] Application domains
include cluster analysis in computer vision and image processing.[2]

4. Dimensionality reduction[show]

Factor analysis

Factor analysis
is a statistical method used to describe variability among observed, correlated
variables in terms of a potentially lower number of unobserved variables called
factors. For example, it is possible that variations in six observed variables
mainly reflect the variations in two unobserved (underlying) variables. Factor
analysis searches for such joint variations in response to unobserved latent
variables. The observed variables are modelled as linear combinations of the
potential factors, plus "error" terms. Factor analysis aims to find independent
latent variables. The theory behind factor analytic methods is that the
information gained about the interdependencies between observed variables can
be used later to reduce the set of variables in a dataset. Factor analysis is
commonly used in biology, psychometrics personality theories, marketing,
product management, operations research, and finance. Proponents of factor
analysis believe that it helps to deal with data sets where there are large
numbers of observed variables that are thought to reflect a smaller number of
underlying/latent variables. It is one of the most commonly used
inter-dependency techniques and is used when the relevant set of variables
shows a systematic inter-dependence and the objective is to find out the latent
factors that create a commonality.

Factor analysis
is related to principal component analysis (PCA), but the two are not
identical.[1] There has been significant controversy in the field over
differences between the two techniques (see section on exploratory factor
analysis versus principal components analysis below). PCA is a more basic
version of exploratory factor analysis (EFA) that was developed in the early
days prior to the advent of high-speed computers. From the point of view of
exploratory analysis, the eigenvalues of PCA are inflated component loadings,
i.e., contaminated with error variance.[2][3][4][5][6][7]

CCA

In statistics,
canonical-correlation analysis (CCA) is a way of inferring information from
cross-covariance matrices. If we have two vectors X = (X1, ..., Xn) and Y =
(Y1, ..., Ym) of random variables, and there are correlations among the
variables, then canonical-correlation analysis will find linear combinations of
the Xi and Yj which have maximum correlation with each other.[1] T. R. Knapp
notes that "virtually all of the commonly encountered parametric tests of
significance can be treated as special cases of canonical-correlation analysis,
which is the general procedure for investigating the relationships between two
sets of variables."[2] The method was first introduced by Harold Hotelling
in 1936.[3]

ICA

In signal
processing, independent component analysis (ICA) is a computational method for
separating a multivariate signal into additive subcomponents. This is done by
assuming that the subcomponents are non-Gaussian signals and that they are
statistically independent from each other. ICA is a special case of blind
source separation. A common example application is the "cocktail party
problem" of listening in on one person's speech in a noisy room.

LDA

Linear
discriminant analysis (LDA) is a generalization of Fisher's linear
discriminant, a method used in statistics, pattern recognition and machine
learning to find a linear combination of features that characterizes or
separates two or more classes of objects or events. The resulting combination
may be used as a linear classifier, or, more commonly, for dimensionality
reduction before later classification.

LDA is closely
related to analysis of variance (ANOVA) and regression analysis, which also
attempt to express one dependent variable as a linear combination of other
features or measurements.[1][2] However, ANOVA uses categorical independent
variables and a continuous dependent variable, whereas discriminant analysis
has continuous independent variables and a categorical dependent variable (i.e.
the class label).[3] Logistic regression and probit regression are more similar
to LDA than ANOVA is, as they also explain a categorical variable by the values
of continuous independent variables. These other methods are preferable in
applications where it is not reasonable to assume that the independent
variables are normally distributed, which is a fundamental assumption of the
LDA method.

LDA is also
closely related to principal component analysis (PCA) and factor analysis in
that they both look for linear combinations of variables which best explain the
data.[4] LDA explicitly attempts to model the difference between the classes of
data. PCA on the other hand does not take into account any difference in class,
and factor analysis builds the feature combinations based on differences rather
than similarities. Discriminant analysis is also different from factor analysis
in that it is not an interdependence technique: a distinction between
independent variables and dependent variables (also called criterion variables)
must be made.

LDA works when
the measurements made on independent variables for each observation are
continuous quantities. When dealing with categorical independent variables, the
equivalent technique is discriminant correspondence analysis.[5][6]

NMF

Non-negative
matrix factorization (NMF or NNMF), also non-negative matrix
approximation[1][2] is a group of algorithms in multivariate analysis and
linear algebra where a matrix V is factorized into (usually) two matrices W and
H, with the property that all three matrices have no negative elements. This
non-negativity makes the resulting matrices easier to inspect. Also, in
applications such as processing of audio spectrograms or muscular activity,
non-negativity is inherent to the data being considered. Since the problem is
not exactly solvable in general, it is commonly approximated numerically.

NMF finds
applications in such fields as computer vision, document clustering,[1] chemometrics,
audio signal processing and recommender systems.[3][4]

PCA

Principal
component analysis (PCA) is a statistical procedure that uses an orthogonal
transformation to convert a set of observations of possibly correlated
variables into a set of values of linearly uncorrelated variables called
principal components. The number of distinct principal components is equal to
the smaller of the number of original variables or the number of observations
minus one. This transformation is defined in such a way that the first
principal component has the largest possible variance (that is, accounts for as
much of the variability in the data as possible), and each succeeding component
in turn has the highest variance possible under the constraint that it is
orthogonal to the preceding components. The resulting vectors are an
uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of
the original variables.

PCA was invented
in 1901 by Karl Pearson,[1] as an analogue of the principal axis theorem in
mechanics; it was later independently developed and named by Harold Hotelling
in the 1930s.[2] Depending on the field of application, it is also named the
discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling
transform in multivariate quality control, proper orthogonal decomposition
(POD) in mechanical engineering, singular value decomposition (SVD) of X (Golub
and Van Loan, 1983), eigenvalue decomposition (EVD) of XTX in linear algebra,
factor analysis (for a discussion of the differences between PCA and factor
analysis see Ch. 7 of [3]), Eckart–Young theorem (Harman, 1960), or
Schmidt–Mirsky theorem in psychometrics, empirical orthogonal functions (EOF)
in meteorological science, empirical eigenfunction decomposition (Sirovich,
1987), empirical component analysis (Lorenz, 1956), quasiharmonic modes (*s
et al., 1988), spectral decomposition in noise and vibration, and empirical
modal analysis in structural dynamics.

PCA is mostly
used as a tool in exploratory data analysis and for making predictive models.
It's often used to visualize genetic distance and relatedness between
populations. PCA can be done by eigenvalue decomposition of a data covariance
(or correlation) matrix or singular value decomposition of a data matrix,
usually after mean centering (and normalizing or using Z-scores) the data
matrix for each attribute.[4] The results of a PCA are usually discussed in terms
of component scores, sometimes called factor scores (the transformed variable
values corresponding to a particular data point), and loadings (the weight by
which each standardized original variable should be multiplied to get the
component score).[5]

PCA is the
simplest of the true eigenvector-based multivariate analyses. Often, its
operation can be thought of as revealing the internal structure of the data in
a way that best explains the variance in the data. If a multivariate dataset is
visualised as a set of coordinates in a high-dimensional data space (1 axis per
variable), PCA can supply the user with a lower-dimensional picture, a
projection of this object when viewed from its most informative viewpoint. This
is done by using only the first few principal components so that the
dimensionality of the transformed data is reduced.

PCA is closely
related to factor analysis. Factor analysis typically incorporates more domain
specific assumptions about the underlying structure and solves eigenvectors of
a slightly different matrix.

PCA is also
related to canonical correlation analysis (CCA). CCA defines coordinate systems
that optimally describe the cross-covariance between two datasets while PCA
defines a new orthogonal coordinate system that optimally describes variance in
a single dataset.[6][7]

t-SNE

t-distributed
stochastic neighbor embedding (t-SNE) is a machine learning algorithm for
dimensionality reduction developed by Geoffrey Hinton and Laurens van der
Maaten.[1] It is a nonlinear dimensionality reduction technique that is
particularly well-suited for embedding high-dimensional data into a space of
two or three dimensions, which can then be visualized in a scatter plot.
Specifically, it models each high-dimensional object by a two- or
three-dimensional point in such a way that similar objects are modeled by
nearby points and dissimilar objects are modeled by distant points.

The t-SNE
algorithm comprises two main stages. First, t-SNE constructs a probability
distribution over pairs of high-dimensional objects in such a way that similar
objects have a high probability of being picked, whilst dissimilar points have
an extremely small probability of being picked. Second, t-SNE defines a similar
probability distribution over the points in the low-dimensional map, and it
minimizes the Kullback–Leibler divergence between the two distributions with
respect to the locations of the points in the map. Note that whilst the
original algorithm uses the Euclidean distance between objects as the base of
its similarity metric, this should be changed as appropriate.

t-SNE has been
used in a wide range of applications, including computer security research,[2]
music analysis,[3] cancer research,[4] bioinformatics,[5] and biomedical signal
processing.[6] It is often used to visualize high-level representations learned
by an artificial neural network.[7]

5. Structured prediction[show]

Graphical models (Bayes netCRFHMM)

A graphical
model or probabilistic graphical model (PGM) or structured probabilistic model
is a probabilistic model for which a graph expresses the conditional dependence
structure between random variables. They are commonly used in probability
theory, statistics—particularly Bayesian statistics—and machine learning.

6. Anomaly detection[show]

k-NN

In pattern
recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric
method used for classification and regression.[1] In both cases, the input
consists of the k closest training examples in the feature space. The output
depends on whether k-NN is used for classification or regression:

In k-NN classification,
the output is a class membership. An object is classified by a majority vote of
its neighbors, with the object being assigned to the class most common among
its k nearest neighbors (k is a positive integer, typically small). If k = 1,
then the object is simply assigned to the class of that single nearest
neighbor.

In k-NN
regression, the output is the property value for the object. This value is the
average of the values of its k nearest neighbors.

k-NN is a type
of instance-based learning, or lazy learning, where the function is only
approximated locally and all computation is deferred until classification. The
k-NN algorithm is among the simplest of all machine learning algorithms.

Both for
classification and regression, a useful technique can be to assign weight to
the contributions of the neighbors, so that the nearer neighbors contribute
more to the average than the more distant ones. For example, a common weighting
scheme consists in giving each neighbor a weight of 1/d, where d is the distance
to the neighbor.[2]

The neighbors
are taken from a set of objects for which the class (for k-NN classification)
or the object property value (for k-NN regression) is known. This can be
thought of as the training set for the algorithm, though no explicit training
step is required.

A peculiarity of
the k-NN algorithm is that it is sensitive to the local structure of the
data.[citation needed] The algorithm is not to be confused with k-means,
another popular machine learning technique.

Local outlier factor

In anomaly
detection, the local outlier factor (LOF) is an algorithm proposed by Markus M.
Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding
anomalous data points by measuring the local deviation of a given data point
with respect to its neighbours.[1]

LOF shares some
concepts with DBSCAN and OPTICS such as the concepts of "core
distance" and "reachability distance", which are used for local
density estimation.[2]

7. Neural nets[show]

Autoencoder

An autoencoder,
autoassociator or Diabolo network[1]:19 is an artificial neural network used
for unsupervised learning of efficient codings.[2][3] The aim of an autoencoder
is to learn a representation (encoding) for a set of data, typically for the
purpose of dimensionality reduction. Recently, the autoencoder concept has
become more widely used for learning generative models of data.[4][5]

Deep learning

Deep learning
(also known as deep structured learning or hierarchical learning) is part of a
broader family of machine learning methods based on learning data
representations, as opposed to task-specific algorithms. Learning can be
supervised, semi-supervised or unsupervised.[1][2][3]

Some
representations are loosely based on interpretation of information processing
and communication patterns in a biological nervous system, such as neural
coding that attempts to define a relationship between various stimuli and
associated neuronal responses in the brain.[4]

Deep learning
architectures such as deep neural networks, deep belief networks and recurrent
neural networks have been applied to fields including computer vision, speech
recognition, natural language processing, audio recognition, social network
filtering, machine translation, bioinformatics and drug design[5], where they
have produced results comparable to and in some cases superior[6] to human experts.[7]

Multilayer perceptron

A multilayer
perceptron (MLP) is a class of feedforward artificial neural network. An MLP
consists of at least three layers of nodes. Except for the input nodes, each
node is a neuron that uses a nonlinear activation function. MLP utilizes a
supervised learning technique called backpropagation for training.[1][2] Its
multiple layers and non-linear activation distinguish MLP from a linear
perceptron. It can distinguish data that is not linearly separable.[3]

Multilayer
perceptrons are sometimes colloquially referred to as "vanilla"
neural networks, especially when they have a single hidden layer.[4]

RNN

A recurrent
neural network (RNN) is a class of artificial neural network where connections
between units form a directed cycle. This allows it to exhibit dynamic temporal
behavior. Unlike feedforward neural networks, RNNs can use their internal
memory to process arbitrary sequences of inputs. This makes them applicable to
tasks such as unsegmented, connected handwriting recognition[1] or speech
recognition.[2][3]

Restricted Boltzmann machine

A restricted
Boltzmann machine (RBM) is a generative stochastic artificial neural network
that can learn a probability distribution over its set of inputs.

RBMs were
initially invented under the name Harmonium by Paul Smolensky in 1986,[1] and
rose to prominence after Geoffrey Hinton and collaborators invented fast
learning algorithms for them in the mid-2000s. RBMs have found applications in
dimensionality reduction,[2] classification,[3] collaborative filtering,[4]
feature learning[5] and topic modelling.[6] They can be trained in either
supervised or unsupervised ways, depending on the task.

As their name
implies, RBMs are a variant of Boltzmann machines, with the restriction that
their neurons must form a bipartite graph: a pair of nodes from each of the two
groups of units (commonly referred to as the "visible" and
"hidden" units respectively) may have a symmetric connection between
them; and there are no connections between nodes within a group. By contrast,
"unrestricted" Boltzmann machines may have connections between hidden
units. This restriction allows for more efficient training algorithms than are
available for the general class of Boltzmann machines, in particular the
gradient-based contrastive divergence algorithm.[7]

Restricted
Boltzmann machines can also be used in deep learning networks. In particular,
deep belief networks can be formed by "stacking" RBMs and optionally
fine-tuning the resulting deep network with gradient descent and
backpropagation.[8]

SOM

A
self-organizing map (SOM) or self-organizing feature map (SOFM) is a type of
artificial neural network (ANN) that is trained using unsupervised learning to
produce a low-dimensional (typically two-dimensional), discretized
representation of the input space of the training samples, called a map, and is
therefore a method to do dimensionality reduction. Self-organizing maps differ
from other artificial neural networks as they apply competitive learning as
opposed to error-correction learning (such as backpropagation with gradient
descent), and in the sense that they use a neighborhood function to preserve
the topological properties of the input space.

This makes SOMs
useful for visualizing low-dimensional views of high-dimensional data, akin to
multidimensional scaling. The artificial neural network introduced by the
Finnish professor Teuvo Kohonen in the 1980s is sometimes called a Kohonen map
or network.[1][2] The Kohonen net is a computationally convenient abstraction
building on biological models of neural systems from the 1970s[3] and
morphogenesis models dating back to Alan Turing in the 1950s.[4]

Like most
artificial neural networks, SOMs operate in two modes: training and mapping.
"Training" builds the map using input examples (a competitive
process, also called vector quantization), while "mapping"
automatically classifies a new input vector.

A
self-organizing map consists of components called nodes or neurons. Associated
with each node are a weight vector of the same dimension as the input data
vectors, and a position in the map space. The usual arrangement of nodes is a
two-dimensional regular spacing in a hexagonal or rectangular grid. The
self-organizing map describes a mapping from a higher-dimensional input space
to a lower-dimensional map space. The procedure for placing a vector from data
space onto the map is to find the node with the closest (smallest distance
metric) weight vector to the data space vector.

While it is
typical to consider this type of network structure as related to feedforward
networks where the nodes are visualized as being attached, this type of
architecture is fundamentally different in arrangement and motivation.

Useful
extensions include using toroidal grids where opposite edges are connected and
using large numbers of nodes.

It has been
shown that while self-organizing maps with a small number of nodes behave in a
way that is similar to K-means, larger self-organizing maps rearrange data in a
way that is fundamentally topological in character.

It is also
common to use the U-Matrix.[5] The U-Matrix value of a particular node is the average
distance between the node's weight vector and that of its closest neighbors.[6]
In a square grid, for instance, we might consider the closest 4 or 8 nodes (the
Von Neumann and Moore neighborhoods, respectively), or six nodes in a hexagonal
grid.

Large SOMs
display emergent properties. In maps consisting of thousands of nodes, it is
possible to perform cluster operations on the map itself.[7]

Convolutional neural network

In machine
learning, a convolutional neural network (CNN, or ConvNet) is a class of deep,
feed-forward artificial neural networks that has successfully been applied to
analyzing visual imagery.

CNNs use a
variation of multilayer perceptrons designed to require minimal
preprocessing.[1] They are also known as shift invariant or space invariant
artificial neural networks (SIANN), based on their shared-weights architecture
and translation invariance characteristics.[2][3]

Convolutional
networks were inspired by biological processes[4] in which the connectivity
pattern between neurons is inspired by the organization of the animal visual
cortex. Individual cortical neurons respond to stimuli only in a restricted
region of the visual field known as the receptive field. The receptive fields
of different neurons partially overlap such that they cover the entire visual
field.

CNNs use
relatively little pre-processing compared to other image classification
algorithms. This means that the network learns the filters that in traditional
algorithms were hand-engineered. This independence from prior knowledge and
human effort in feature design is a major advantage.

They have
applications in image and video recognition, recommender systems[5] and natural
language processing.[6]

8. Reinforcement learning[show]

Q-learning

Q-learning is a
model-free reinforcement learning technique. Specifically, Q-learning can be
used to find an optimal action-selection policy for any given (finite) Markov
decision process (MDP)[citation needed]. It works by learning an action-value
function, often denoted by Q ( s , a )  
{\displaystyle Q(s,a)} 
{\displaystyle Q(s,a)}, which ultimately gives the expected utility of
taking a given action a   {\displaystyle
a}  a in a given state s   {\displaystyle s}  s, and following an optimal policy
thereafter. A policy, often denoted by π  
{\displaystyle \pi }  \pi , is a
rule that the agent follows in selecting actions, given the state it is in.
When such an action-value function is learned, the optimal policy can be
constructed by simply selecting the action with the highest value in each
state. One of the strengths of Q-learning is that it is able to compare the
expected utility of the available actions without requiring a model of the
environment. Additionally, Q-learning can handle problems with stochastic
transitions and rewards, without requiring any adaptations. It has been proven
that for any finite MDP, Q-learning eventually finds an optimal policy, in the
sense that the expected value of the total reward return over all successive
steps, starting from the current state, is the maximum achievable.[1]

SARSA

State–action–reward–state–action
(SARSA) is an algorithm for learning a Markov decision process policy, used in
the reinforcement learning area of machine learning. It was introduced in a
technical note[1] where the alternative name SARSA was only mentioned as a
footnote.

This name simply
reflects the fact that the main function for updating the Q-value depends on
the current state of the agent "S1", the action the agent chooses
"A1", the reward "R" the agent gets for choosing this
action, the state "S2" that the agent will now be in after taking
that action, and finally the next action "A2" the agent will choose
in its new state. Taking every letter in the quintuple (st, at, rt, st+1, at+1)
yields the word SARSA.[2]

Temporal difference (TD)

Temporal
difference (TD) learning is a prediction-based machine learning method. It has
primarily been used for the reinforcement learning problem, and is said to be
"a combination of Monte Carlo ideas and dynamic programming (DP)
ideas."[1] TD resembles a Monte Carlo method because it learns by sampling
the environment according to some policy[clarification needed], and is related
to dynamic programming techniques as it approximates its current estimate based
on previously learned estimates (a process known as bootstrapping). The TD
learning algorithm is related to the temporal difference model of animal
learning.[2]

As a prediction
method, TD learning considers that subsequent predictions are often correlated
in some sense. In standard supervised predictive learning, one learns only from
actually observed values: A prediction is made, and when the observation is
available, the prediction mechanism is adjusted to better match the
observation. As elucidated by Richard Sutton, the core idea of TD learning is
that one adjusts predictions to match other, more accurate, predictions about
the future.[3] This procedure is a form of bootstrapping, as illustrated with
the following example:

"Suppose
you wish to predict the weather for Saturday, and you have some model that
predicts Saturday's weather, given the weather of each day in the week. In the
standard case, you would wait until Saturday and then adjust all your models.
However, when it is, for example, Friday, you should have a pretty good idea of
what the weather would be on Saturday - and thus be able to change, say,
Monday's model before Saturday arrives."[3]

Mathematically
speaking, both in a standard[which?] and a TD approach, one would try to
optimize some cost function, related to the error in our predictions of the
expectation of some random variable, E[z]. However, while in the standard
approach one in some sense assumes E[z] = z (the actual observed value), in the
TD approach we use a model. For the particular case of reinforcement learning,
which is the major application of TD methods[according to whom?], z is the
total return and E[z] is given by the Bellman equation of the
return[clarification needed].

9. Theory[show]

Bias-variance dilemma

In statistics
and machine learning, the bias–variance tradeoff (or dilemma) is the problem of
simultaneously minimizing two sources of error that prevent supervised learning
algorithms from generalizing beyond their training set:[citation needed]

The bias is an
error from erroneous assumptions in the learning algorithm. High bias can cause
an algorithm to miss the relevant relations between features and target outputs
(underfitting).

The variance is
an error from sensitivity to small fluctuations in the training set. High
variance can cause an algorithm to model the random noise in the training data,
rather than the intended outputs (overfitting).

The
bias–variance decomposition is a way of analyzing a learning algorithm's
expected generalization error with respect to a particular problem as a sum of
three terms, the bias, variance, and a quantity called the irreducible error,
resulting from noise in the problem itself.

This tradeoff
applies to all forms of supervised learning: classification, regression
(function fitting),[1][2] and structured output learning. It has also been invoked
to explain the effectiveness of heuristics in human learning.[3]

Computational learning theory

In computer
science, computational learning theory (or just learning theory) is a subfield
of Artificial Intelligence devoted to studying the design and analysis of
machine learning algorithms.[1]

Empirical risk minimization

Empirical risk
minimization (ERM) is a principle in statistical learning theory which defines
a family of learning algorithms and is used to give theoretical bounds on the
performance of learning algorithms.

Occam learning

In computational
learning theory, Occam learning is a model of algorithmic learning where the
objective of the learner is to output a succinct representation of received
training data. This is closely related to probably approximately correct (PAC)
learning, where the learner is evaluated on its predictive power of a test set.

Occam
learnability implies PAC learning, and for a wide variety of concept classes,
the converse is also true: PAC learnability implies Occam learnability.

PAC learning

In computational
learning theory, probably approximately correct learning (PAC learning) is a
framework for mathematical analysis of machine learning. It was proposed in
1984 by Leslie Valiant.[1]

In this
framework, the learner receives samples and must select a generalization
function (called the hypothesis) from a certain class of possible functions.
The goal is that, with high probability (the "probably" part), the
selected function will have low generalization error (the "approximately
correct" part). The learner must be able to learn the concept given any
arbitrary approximation ratio, probability of success, or distribution of the
samples.

The model was
later extended to treat noise (misclassified samples).

An important
innovation of the PAC framework is the introduction of computational complexity
theory concepts to machine learning. In particular, the learner is expected to
find efficient functions (time and space requirements bounded to a polynomial
of the example size), and the learner itself must implement an efficient
procedure (requiring an example count bounded to a polynomial of the concept
size, modified by the approximation and likelihood bounds).

Statistical learning

Statistical
learning theory is a framework for machine learning drawing from the fields of
statistics and functional analysis.[1][2] Statistical learning theory deals
with the problem of finding a predictive function based on data. Statistical
learning theory has led to successful applications in fields such as computer
vision, speech recognition, bioinformatics and baseball.[3]

VC theory

Vapnik–Chervonenkis
theory (also known as VC theory) was developed during 1960–1990 by Vladimir
Vapnik and Alexey Chervonenkis. The theory is a form of computational learning
theory, which attempts to explain the learning process from a statistical point
of view.

VC theory is
related to statistical learning theory and to empirical processes. Richard M.
Dudley and Vladimir Vapnik, among others, have applied VC-theory to empirical processes.

10. Machine-learning
venues
[show]

NIPS

The Conference
and Workshop on Neural Information Processing Systems (NIPS) is a machine
learning and computational neuroscience conference held every December. The
conference is currently a double-track meeting (single-track until 2015) that
includes invited talks as well as oral and poster presentations of refereed
papers, followed by parallel-track workshops that up to 2013 were held at ski
resorts.

ICML

The
International Conference on Machine Learning (ICML) is the leading
international academic conference in machine learning. Along with NIPS, it is
one of the two primary conferences of high impact in Machine Learning and
Artificial Intelligence research. It is supported by the International Machine
Learning Society (IMLS).

ML

Machine Learning
is a peer-reviewed scientific journal, published since 1986.

In 2001, forty
editors and members of the editorial board of Machine Learning resigned in
order to support the Journal of Machine Learning Research (JMLR), saying that
in the era of the internet, it was detrimental for researchers to continue
publishing their papers in expensive journals with pay-access archives.
Instead, they wrote, they supported the model of JMLR, in which authors
retained copyright over their papers and archives were freely available on the
internet.[1]

Following the
mass resignation, Kluwer changed their publishing policy to allow authors to
self-archive their papers online after peer-review. [2]

JMLR

The Journal of
Machine Learning Research is a peer-reviewed open access scientific journal
covering machine learning. It was established in 2000 and the first
editor-in-chief was Leslie Kaelbling.[1] The editors-in-chief are Kevin Murphy
(Google) and Bernhard Schölkopf (Max Planck Institute for Intelligent Systems).

ArXiv:cs.LG

Learning

Authors and
titles for recent submissions

•Tue, 23 Jan 2018

•Mon, 22 Jan
2018

•Fri, 19 Jan
2018

•Thu, 18 Jan
2018

•Wed, 17 Jan
2018

Machine Learning and Data Mining(机器学习与数据挖掘)的更多相关文章

  1. Machine Learning and Data Mining Lecture 1

    Machine Learning and Data Mining Lecture 1 1. The learning problem - Outline     1.1 Example of mach ...

  2. How do you explain Machine Learning and Data Mining to non Computer Science people?

    How do you explain Machine Learning and Data Mining to non Computer Science people?   Pararth Shah, ...

  3. Note for video Machine Learning and Data Mining——Linear Model

    Here is the note for lecture three. the linear model Linear model is a basic and important model in ...

  4. Note for video Machine Learning and Data Mining——training vs Testing

    Here is the note for lecture five. There will be several points  1. Training and Testing  Both of th ...

  5. Machine Learning——吴恩达机器学习笔记(酷

    [1] ML Introduction a. supervised learning & unsupervised learning 监督学习:从给定的训练数据集中学习出一个函数(模型参数), ...

  6. Machine Learning、Date Mining、IR&NLP 会议期刊论文推荐

    核心期刊排名查询 http://portal.core.edu.au/conf-ranks/ http://portal.core.edu.au/jnl-ranks/ 1.机器学习推荐会议 ICML— ...

  7. Applied Spatiotemporal Data Mining应用时空数据挖掘

    Course descriptionWith the continuing advances of geographic information science and geospatialtechn ...

  8. Machine Learning and Data Science 教授大师

    http://www.cs.cmu.edu/~avrim/courses.html Foundations of Data Science Avrim Blum, www.cs.cornell.edu ...

  9. How do I learn machine learning?

    https://www.quora.com/How-do-I-learn-machine-learning-1?redirected_qid=6578644   How Can I Learn X? ...

随机推荐

  1. 34.数组中2个只出现一次的数字[Find two numbers which appear once]

    [题目] 一个整型数组里除了两个数字之外,其他的数字都出现了两次.请写程序找出这两个只出现一次的数字.要求时间复杂度是O(n),空间复杂度是O(1). [分析] 这是一道很新颖的关于位运算的面试题. ...

  2. HTML5之拖放

    - Draggable 标签  文件拖放 99年IE5开始,05后所有浏览器支持(除了opera) <li id=be draggable=true ondragstart="star ...

  3. 防御XSS攻击的七条原则

    本文将会着重介绍防御XSS攻击的一些原则,需要读者对于XSS有所了解,至少知道XSS漏洞的基本原理,如果您对此不是特别清楚,请参考这两篇文章:<Stored and Reflected XSS ...

  4. eclipse通过classpath variable引用类库

    众所周知.eclipse的project bulid path中能够引用第三方类库(如图1). 图1 可是这样的方式有个缺点:对类库的引用是通过绝对路径.假设有两台电脑(办公室1台.家1台),非常可能 ...

  5. dedecms做好的网站怎么上传到网上?

    1.首先做好网站后把网站所有和数据库备份 dedecms  点击 系统 - 数据库备份/还原 - 全选  后---提交-----等待备份完全 备份文件在哪里:data/backupadta--- 2. ...

  6. 201771010118 马昕璐《面向对象程序设计java》第十二周学习总结

    第一部分:理论知识学习部分 用户界面:用户与计算机系统(各种程序)交互的接口 图形用户界面:以图形方式呈现的用户界面 AET:Java 的抽象窗口工具箱包含在java.awt包中,它提供了许多用来设计 ...

  7. Excel--按内容分页打印

    当我们有这样一张表,需要按不同城市分页打印,每页带标题行,可按以下步骤:1.点击城市一列任一单元格,点击“开始”——>“排序和筛选”(升序): 2.点击“数据”-->“分类汇总”: 分类字 ...

  8. jenkins findbugs流编码问题:DM&lowbar;DEFAULT&lowbar;ENCODING

    报错信息: MessageParserUtil.java:122, DM_DEFAULT_ENCODING, Priority: High Dm: Found reliance on default ...

  9. Java多线程-----实现生产者消费者模式的几种方式

       1 生产者消费者模式概述 生产者消费者模式就是通过一个容器来解决生产者和消费者的强耦合问题.生产者和消费者彼此之间不直接通讯,而通过阻塞队列来进行通讯,所以生产者生产完数据之后不用等待消费者处理 ...

  10. &lbrack;Linux实用工具&rsqb;Ubuntu环境下SSH的安装及使用

    SSH分为客户端和服务端. 服务端是一个守护进程,一般是sshd进程,在后台运行并响应来自客户端的请求.提供了对远程请求的处理,一般包括公共密钥认证.密钥交换.对称密钥加密和非安全连接. 客户端一般是 ...