1 - Factorization Machines ( Steffen Rendle, 2010 )
ABSTRACT
In this paper, we introduce Factorization Machines (FM) which are a new model class that combines the advantages of Support Vector Machines (SVM) with factorization models.
由上可知,FM模型是一种新的模型,其综合了SVM模型与factorization模型的优点。
factorization models的基本形式为:
f(x)=q1(x)q2(x)q3(x)...qt(x) 。
Like SVMs, FMs are a general predictor working with any real valued feature vector. In contrast to SVMs, FMs model all interactions between variables using factorized parameters. Thus they are able to estimate interactions even in problems with huge sparsity (like recommender systems) where SVMs fail.
在特征较为稀疏的情况下,如推荐系统等,SVM模型不再适用
为了解决该问题,FM模型引入factorized parameters,该参数用于对交叉特征
进行学习
We show that the model equation of FMs can be calculated in linear time and thus FMs can be optimized directly. So unlike nonlinear SVMs, a transformation in the dual form is not necessary and the model parameters can be estimated directly without the need of any support vector in the solution. We show the relationship to SVMs and the advantages of FMs for parameter estimation in sparse settings.
FM模型避免了SVM模型在训练时的弊端,其模型复杂度为
O(n)
On the other hand there are many different factorization models like matrix factorization
, parallel factor analysis
or specialized models
like SVD++, PITF or FPMC. The drawback of these models is that they are not applicable for general prediction tasks but work only with special input data. Furthermore their model equations and optimization algorithms are derived individually for each task.
factorization models的两个缺点:
- 对输入的数据有限制,如推荐系统中,输入数据的形式为:
uid, sid, score
;- 模型及其所采用的优化方法需要根据具体的task进行定制;
We show that FMs can mimic these models just by specifying the input data (i.e. the feature vectors). This makes FMs easily applicable even for users without expert knowledge in factorization models.
I. INTRODUCTION
Support Vector Machines are one of the most popular predictors in machine learning and data mining. Nevertheless in settings like collaborative filtering, SVMs play no important role and the best models are either direct applications of standard matrix/ tensor factorization models like PARAFAC [1] or specialized models using factorized parameters [2], [3], [4].
In this paper, we show that the only reason why standard SVM predictors are not successful in these tasks is that they cannot learn reliable parameters (‘hyperplanes’) in complex (non-linear) kernel spaces under very sparse data.
重点关注论文中,如何证明为什么non-linear SVM不适用于稀疏数据集。
On the other hand, the drawback of tensor factorization models and even more for specialized factorization models is that
(1) they are not applicable to standard prediction data (e.g. a real valued feature vector in
(2) that specialized models are usually derived individually for a specific task requiring effort in modeling and design of a learning algorithm.
In this paper, we introduce a new predictor, the Factorization Machine (FM), that is a general predictor like SVMs but is also able to estimate reliable parameters under very high sparsity.
The factorization machine models all nested variable interactions(comparable to a polynomial kernel in SVM), but uses a factorized parametrization
instead of a dense parametrization
like in SVMs.
polynomial kernel:多项式核函数,形式为
Kn(X,X′)=(1+γXTX′)n,其中γ>0 polynomial kernel-SVM模型,其模型参数采用
dense parametrization
,而FM模型参数采用factorized parametrization
We show that the model equation of FMs can be computed in linear time and that it depends only on a linear number of parameters. This allows direct optimization and storage of model parameters without the need of storing any training data (e.g. support vectors) for prediction. In contrast to this, non-linear SVMs are usually optimized in the dual form and computing a prediction (the model equation) depends on parts of the training data (the support vectors).
We also show that FMs subsume many of the most successful approaches for the task of collaborative filtering including biased MF, SVD++ [2], PITF [3] and FPMC [4].
In total, the advantages of our proposed FM are:
1) FMs allow parameter estimation under very sparse data where SVMs fail.
2) FMs have linear complexity, can be optimized in the primal and do not rely on support vectors like SVMs. We show that FMs scale to large datasets like Netflix with 100 millions of training instances.
3) FMs are a general predictor that can work with any real valued feature vector. In contrast to this, other state-of-the-art factorization models work only on very restricted input data. We will show that just by defining the feature vectors of the input data, FMs can mimic state-of-the-art models like biased MF, SVD++, PITF or FPMC.
II. PREDICTION UNDER SPARSITY
The most common prediction task is to estimate a function
In this paper, we deal with problems where
Example 1 Assume we have the transaction data of a movie review system. The system records which user
Let the observed data
An example for a prediction task using this data, is to estimate a function
Figure 1 shows one example of how feature vectors can be created from
We will use this example data throughout the paper for illustration. However please note that FMs are general predictors like SVMs and thus are applicable to any real valued feature vectors and are not restricted to recommender systems.
III. FACTORIZATION MACHINES (FM)
In this section, we introduce factorization machines. We discuss the model equation in detail and show shortly how to apply FMs to several prediction tasks.
A. Factorization Machine Model
1) Model Equation: The model equation for a factorization machine of degree
where the model parameters that have to be estimated are:
A row
A 2-way FM (degree
-
w0 is the global bias. -
wi models the strength of thei -th variable. -
wˆi,j:=⟨vi,vj⟩ models the interaction between thei -th andj -th variable. Instead of using an own model parameterwi,j∈ℝ for each interaction, the FM models the interaction by factorizing it. We will see later on, that this is the key point which allows high quality parameter estimates of higher order interactions (d≥2 ) under sparsity.
2) Expressiveness: It is well known that for any positive definite matrix
3) Parameter Estimation Under Sparsity: In sparse settings, there is usually not enough data to estimate interactions between variables directly and independently. Factorization machines can estimate interactions even in these settings well because they break the independence of the interaction parameters by factorizing them. In general this means that the data for one interaction helps also to estimate the parameters for related interactions. We will make the idea more clear with an example from the data in figure 1.Assume we want to estimate the interaction between Alice (A) and Star Trek (ST) for predicting the target
4) Computation: Next, we show how to make FMs applicable from a computational point of view. The complexity of straight forward computation of eq. (1) is in
Lemma 3.1: The model equation of a factorization machine (eq. (1)) can be computed in linear time
Proof: Due to the factorization of the pairwise interactions, there is no model parameter that directly depends on two variables (e.g. a parameter with an index
This equation has only linear complexity in both
Moreover, under sparsity most of the elements in
B. Factorization Machines as Predictors
FM can be applied to a variety of prediction tasks. Among them are:
-
Regression:
yˆ(x) can be used directly as the predictor and the optimization criterion is e.g. the minimal least square error on D. -
Binaryclassification: the sign of
yˆ(x) is used and the parameters are optimized for hinge loss or logit loss. -
Ranking: the vectors x are ordered by the score of
yˆ(x) and optimization is done over pairs of instance vectors(x(a),x(b))∈D with a pairwise classification loss (e.g. like in [5]).
In all these cases, regularization terms like
C. Learning Factorization Machines
As we have shown, FMs have a closed model equation that can be computed in linear time. Thus, the model parameters (
The sum
We provide a generic implementation, LIBFM, that uses SGD and supports both element-wise and pairwise losses.
D. d-way Factorization Machine
The 2-way FM described so far can easily be generalized to a d-way FM:
where the interaction parameters for the
The straight-forward complexity for computing eq. (5) is
E. Summary
FMs model all possible interactions between values in the feature vector
- The interactions between values can be estimated even under high sparsity. Especially, it is possible to general- ize to unobserved interactions.
- The number of parameters as well as the time for prediction and learning is linear. This makes direct optimization using SGD feasible and allows optimizing against a variety of loss functions.
In the remainder of this paper, we will show the relationships between factorization machines and support vector machines as well as matrix, tensor and specialized factorization models.
IV. FMs VS. SVMs
A. SVM model
The model equation of an SVM [6] can be expressed as the dot product between the transformed input
In the following, we discuss the relationships of FMs and SVMs by analyzing the primal form of the SVMs.
In practice, SVMs are solved in the dual form and the mapping φ is not performed explicitly. Nevertheless, the primal and dual have the same solution (optimum), so all our arguments about the primal hold also for the dual form.
1) Linear kernel: The most simple kernel is the linear kernel:
It is obvious that a linear SVM (eq. (7)) is identical to a FM of degree d = 1 (eq. (5)).
2) Polynomial kernel: The polynomial kernel allows the SVM to model higher interactions between variables. It is defined as
And so, the model equation for polynomial SVMs can be rewritten as:
where the model parameters are:
Comparing a polynomial SVM (eq. (9)) to a FM (eq. (1)), one can see that both model all nested interactions up to degree
B. Parameter Estimation Under Sparsity
In the following, we will show why linear and polynomial SVMs fail for very sparse problems. We show this for the example of collaborative filtering with user and item indicator variables(see the first two groups (blue and red) in the example of figure 1). Here, the feature vectors are sparse and only two elements are non-zero (the active user
1) Linear SVM: For this kind of data x, the linear SVM model (eq. (7)) is equivalent to:
Because
2) Polynomial SVM: With the polynomial kernel, the SVM can capture higher-order interactions (here between users and items). In our sparse case with
First of all,
For SVMs, estimating higher-order interactions is not only an issue in CF but in all scenarios where the data is hugely sparse. Because for a reliable estimate of the parameter
C. Summary
- The dense parametrization of SVMs requires direct observations for the interactions which is often not given in sparse settings. Parameters of FMs can be estimated well even under sparsity (see section III-A3).
- FMs can be directly learned in the primal. Non-linear SVMs are usually learned in the dual.
- The model equation of FMs is independent of the training data. Prediction with SVMs depends on parts of the training data (the support vectors).
V. FMs VS. OTHER FACTORIZATION MODELs
There is a variety of factorization models, ranging from standard models for m-ary relations over categorical variables (e.g. MF, PARAFAC) to specialized models for specific data and tasks (e.g. SVD++, PITF, FPMC). Next, we show that FMs can mimic many of these models just by using the right input data (e.g. feature vector
A. Matrix and Tensor Factorization
Matrix factorization (MF) is one of the most studied factorization models (e.g. [7], [8], [2]). It factorizes a relationship between two categorical variables (e.g.
To shorten notation, we address elements in x (e.g.
xj ) and the parameters both by numbers (e.g.j∈1,...,n ) and categorical levels (e.g.j∈(U⋃I) ). That means we implicitly assume a bijective mapping from numbers to categorical levels.
A FM using this feature vector
With the same argument, one can see that for problems with more than two categorical variables, FMs includes a nested parallel factor analysis model (PARAFAC).
B. SVD++
For the task of rating prediction (i.e. regression), Koren improves the matrix factorization model to the SVD++ model. A FM can mimic this model by using the following input data
where
To distinguish elements in
Nu from elements in I, they are transformed with any bijective functionω:I→L into a spaceL withL∩I=∅ .
where the first part is exactly the same as the SVD++ model. But the FM contains also some additional interactions between users and movies
C. PITF for Tag Recommendation
The problem of tag prediction is defined as ranking tags for a given user and item combination. That means there are three categorical domains involved: users
As this model is used for ranking between two tags
Now the original PITF model and the FM model with binary indicators (eq. (14)) are almost identical. The only difference is that (i) the FM model has a bias term
D. Factorized Personalized Markov Chains (FPMC)
The FPMC model tries to rank products in an online shop based on the last purchases (at time
Again just by feature generation, a factorization machine (
where
Like for tag recommendation this model is used and optimized for ranking (here ranking items
Now one can see that the original FPMC model and the FM model are almost identical and differ only in the additional item bias wi and the sharing of factorization parameters of the FM model for the items in both the
E. Summary
- Standard factorization models like PARAFAC or MF are not general prediction models like factorization machines. Instead they require that the feature vector is partitioned in m parts and that in each part exactly one element is 1 and the rest 0.
- There are many proposals for specialized factorization models designed for a single task. We have shown that factorization machines can mimic many of the most suc- cessful factorization models (including MF, PARAFAC, SVD++, PITF, FPMC) just by feature extraction which makes FM easily applicable in practice.
VI. CONCLUSION AND FUTURE WORK
In this paper, we have introduced factorization machines. FMs bring together the generality of SVMs with the benefits of factorization models. In contrast to SVMs, (1) FMs are able to estimate parameters under huge sparsity, (2) the model equation is linear and depends only on the model parameters and thus (3) they can be optimized directly in the primal. The expressiveness of FMs is comparable to the one of polynomial SVMs. In contrast to tensor factorization models like PARAFAC, FMs are a general predictor that can handle any real valued vector. Moreover, simply by using the right indicators in the input feature vector, FMs are identical or very similar to many of the specialized state-of-the-art models that are applicable only for a specific task, among them are biased MF, SVD++, PITF and FPMC.