图上的机器学习任务
A graph is an interesting type of data. We could’ve thought that we can make predictions and train the model in the same way as with “normal” data. Surprisingly, machine learning tasks are defined much differently on graphs and we can categorize it into 4 types: node classification, link prediction, learning over the whole graph, and community detection. In this article, we look closer at how they’re defined and understand why they’re so much different from standard machine learning tasks.
的曲线图是一个有趣的类型的数据。 我们本以为可以像使用“正常”数据一样进行预测和训练模型。 令人惊讶的是,机器学习任务在图上的定义大不相同,我们可以将其分为4种类型:节点分类,链接预测,整个图上的学习和社区检测。 在本文中,我们将仔细研究它们的定义方式,并理解为什么它们与标准的机器学习任务有很大不同。
节点分类 (Node Classification)
Let’s imagine that we have a network of friendships between animals . The animal can be either a dog, a cat, or a duck and we can describe it with additional features such as weight, height, and colour. A connection between two specific animals means that they like each other. Our goal, given this friendship network and features, will be to predict the missing types of the animal. This prediction task is known as the node classification.
假设我们在动物之间建立了友谊网络。 动物可以是狗,猫或鸭,而我们 可以用重量,高度和颜色等其他功能来描述它。 两种特定动物之间的联系意味着它们彼此喜欢。 鉴于这种友谊网络和特征,我们的目标将是预测动物的失踪类型。 此预测任务称为节点分类。
Let’s formalize the notation. Mathematically, we can define this animal friendship network as G = (V, Ε), where V is a node (animal), and E is edge (friendship connection). Also, each node has a respective feature vector xi (weight, height, and colour), where i means that the vector belongs to the node Vi. The goal of the node classification task is to predict label yi given the node Vi and its neighbours.
让我们形式化表示法。 从数学上讲,我们可以将该动物友谊网络定义为G =(V, E ),其中V是节点(动物),E是边缘(友谊连接)。 此外,每个节点都有一个相应的特征向量x I(体重,身高,和颜色),其中i表示该矢量属于该节点V I。 节点归类任务的目标是预测标记y我给出的节点 V I和它的邻国。
Now, how can we use machine learning models to predict these animal types? Most of the machine learning models are based on the fact that data points are independent from each other (i.i.d assumption). Here this assumption fails because the node labels (animal type) might be dependent on other neighbouring nodes [1]. For example, it is more probable that the node closer to the cluster of cats is also a cat. Similar nodes are typically closer together which is known as homophily.
现在,我们如何使用机器学习模型来预测这些动物类型? 大多数机器学习模型都是基于数据点彼此独立的事实(同上假设)。 这里的假设失败了,因为节点标签(动物类型)可能依赖于其他相邻节点[1]。 例如,更可能是靠近猫群的节点也是猫。 相似的节点通常更靠近在一起,这被称为同构。
Because the data independence doesn’t work in this case anymore, we can’t classify this classification task as supervised learning. It is often referred by researchers as semi-supervised learning because we can use information from neighbouring nodes to predict a certain node [1].
因为在这种情况下数据独立性不再起作用,所以我们不能将此分类任务分类为监督学习。 研究人员通常将其称为半监督学习,因为我们可以使用来自相邻节点的信息来预测某个节点[1]。
链接预测 (Link Prediction)
The objective of the link prediction is to decide whether there is a connection between two nodes. In our previous example with animal friendships graph, it is just a prediction of whether the neighbouring animals are friends.
链接预测的目的是确定两个节点之间是否存在连接。 在前面的带有动物友谊图的示例中,它只是对相邻动物是否为朋友的预测。
Similarly to the node classification, we can also leverage neighbourhood information to predict the link between two nodes. A popular group of approaches for link predictions is called heuristic methods [2]. They compute certain scores from the graph itself and convert it into likelihood of the link between two nodes. Heuristic methods can be divided by the maximum number of neighbourhood hops that have to occur [2]. For example, Common Neighbours is a first-order heuristics as it requires only the direct neighbourhood of the node to compute the score (and not neighbours of the neighbours). In the image below, we can see the 1st neighbourhood of nodes V1 and V3.
类似于节点分类,我们还可以利用邻域信息来预测两个节点之间的链接。 一组流行的链接预测方法称为启发式方法[2]。 他们从图本身计算某些分数,并将其转换为两个节点之间链接的可能性。 启发式方法可以除以必须发生的最大邻域跳数[2]。 例如,“公共邻居”是一阶启发式方法,因为它只需要节点的直接邻居即可计算分数(而不需要邻居的邻居)。 在下图中,我们可以看到节点V 1和V 3的第一个邻域。
Frog V1 and V3 have 2 friends (neighbouring nodes) in common: V6 and V2. With this simple score, the Common Neighbours algorithm decides whether there is a link between two nodes or not.
青蛙V 1和V 3有两个共同的朋友(邻居节点): V 6和V2 。有了这个简单的分数,Common Neighbors算法就可以确定两个节点之间是否存在链接。
Of course, there are more complex approaches, for example, the Resource Allocation (2nd order heuristics) which uses information from 2nd-order hops [2]. In the case of the node V5, RA would use for its algorithm V4 and V6 from the 1st neighbourhood and V1 and V3 from the 2nd neighbourhood. Other approaches can use even higher order heuristics to generate graph features for link prediction.
当然,还有更复杂的方法,例如,使用来自第二阶跃点的信息的资源分配(第二阶启发法)[2]。 在节点V 5的情况下,RA将使用其来自第一个邻域的V 4和V 6以及来自第二个邻域的V 1和V 3的算法。 其他方法甚至可以使用更高阶的启发式方法来生成用于链接预测的图形特征。
Similarly to the node classification, the link prediction is also known as semi-supervised learning as we use neighbourhood information to predict the link between two nodes.
类似于节点分类,链接预测也称为半监督学习,因为我们使用邻域信息来预测两个节点之间的链接。
学习整个图:分类,回归,聚类 (Learning Over the Whole Graph: Classification, Regression, Clustering)
Let’s change the example. Consider, that we have now a molecular data and our task is to predict if the given molecule is toxic. The illustration below shows how this classification task can be designed using graph neural networks.
让我们更改示例。 考虑一下,我们现在有了分子数据,我们的任务是预测给定的分子是否有毒。 下图显示了如何使用图神经网络设计该分类任务。
We can consider each molecule as a separate graph where an atom is a node and link between atoms is an edge. This is an example of the classification task over the whole graph. The difference here is that we are given multiple instances of different graphs and train our model on them [1]. We learn over the whole graph instead of predicting specific components within a single graph such as a node or an edge.
我们可以将每个分子视为一个单独的图,其中一个原子是一个节点,原子之间的连接是一个边。 这是整个图形中分类任务的一个示例。 此处的区别在于,我们获得了不同图的多个实例,并在它们上训练了我们的模型[1]。 我们学习整个图,而不是预测单个图内的特定组成部分,例如节点或边。
What is compelling about the learning task over multiple instances of graphs, is that datapoints are considered to be i.i.d. It means that this learning task on graphs is very similar, if not the same, to classification, regression, and clustering tasks that are used in standard machine learning problems. Those are great news for us because we can reuse methods from standard machine learning algorithms such as RandomForest, SVMs, or XGBoost.
关于图的多个实例的学习任务的令人信服的是,数据点被认为是iid 。 这意味着该图上的学习任务与标准机器学习问题中使用的分类,回归和聚类任务非常相似(甚至不同)。 这对我们来说是个好消息,因为我们可以重用标准机器学习算法(例如RandomForest,SVM或XGBoost)中的方法。
社区检测 (Community Detection)
Shortly speaking, a community detection for graphs can be considered as a clustering task of nodes within a single graph. Let’s have a look at an example below that shows a network of scientist that co-authored at least one paper together.
简而言之,可以将对图的社区检测视为单个图内节点的聚类任务。 让我们看看下面的示例,该示例显示了由科学家共同组成至少一篇论文的网络。
Here, the task of community detection would be to identify these clusters of scientists working in different fields. Although it seems to be intuitive, clusters of communities are rather vaguely defined and differ in size and shapes amongst different datasets [4]. The communities can also overlap making it even harder to differentiate between them.
在这里,社区发现的任务是识别这些在不同领域工作的科学家群体。 尽管看起来很直观,但是社区集群的定义比较模糊,并且在不同数据集之间的大小和形状也不同[4]。 社区也可以重叠,从而更加难以区分它们。
Intuitively, we can suspect that nodes inside the communities will have more connections to neighbouring edges. There will be also nodes that have fewer edges and connect different communities. Those are the theoretical foundations that most of the community detection algorithms are based on. There are many different types of community detection algorithms but the most popular are methods based on spectral clustering, statistical inference, optimization, and dynamics [6].
凭直觉,我们可以怀疑社区内部的节点与相邻边缘之间的连接将会更多。 也将有节点具有较少的边缘并且连接不同的社区。 这些是大多数社区检测算法所基于的理论基础。 社区检测算法有很多不同类型,但最流行的是基于频谱聚类,统计推断,优化和动力学的方法[6]。
加起来 (Summing Up)
We’ve seen that there are 4 major types of machine learning tasks on graphs: node classification, link prediction, learning over the whole graph, and community detection. Most of these tasks are very different from normal supervised/unsupervised learning. This is because graphs are interconnected with each other and data independence assumption fails. Researchers refer to it as semi-supervised learning.
我们已经看到,图上有4种主要的机器学习任务:节点分类,链接预测,整个图上的学习和社区检测。 这些任务中的大多数与正常的有监督/无监督学习大不相同。 这是因为图形相互连接,并且数据独立性假设失败。 研究人员将其称为半监督学习。
关于我 (About Me)
I am an MSc Artificial Intelligence student at the University of Amsterdam. In my spare time, you can find me fiddling with data or debugging my deep learning model (I swear it worked!). I also like hiking :)
我是阿姆斯特丹大学的人工智能硕士研究生。 在业余时间,您会发现我对数据不满意或调试我的深度学习模型(我发誓它能工作!)。 我也喜欢远足:)
Here are my other social media profiles, if you want to stay in touch with my latest articles and other useful content:
如果您想与我的最新文章和其他有用内容保持联系,这是我的其他社交媒体资料:
翻译自: https://towardsdatascience.com/machine-learning-tasks-on-graphs-7bc8f175119a