paper reading:Part-based Graph Convolutional Network for Action Recognition
paper reading:Part-based Graph Convolutional Network for Action Recognition
文章目录
graph 与 skeleton:
Human skeleton is intuitively represented as a sparse graph with joints as nodes and natural connections between them as edges.
- nodes:joints
- edges:natural connections between joints
传统的 action recognition from S-videos:
- the whole skeleton is treated as a single graph
- 使用 3D coordinate
本文模型使用的两种信息:
- Geometric features:such as relative joint coordinates
- motion features:such as temporal displacements
本文主要贡献:
-
Formulation of a general part-based graph convolutional network (PB-GCN) .
-
Use of geometric and motion features in place of 3D joint locations at each vertex.
即,几何信息(relative joint coordinates)和运动信息(temporal displacements)的使用
-
Exceeding the state-of-the-art on challenging benchmark datasets NTURGB+D and HDM05.
单图(无划分)的卷积公式:
k-th neighborhood
- : a filter weight vector of size of indexed by the label assigned to neighbor in the -neighborhood
- :the input feature at
- :convolved output feature at root vertex
1-th neighborhood
将邻域换一种表示形式(用邻接矩阵表示),且将邻域数从降为1,则得到下面的式子
- ;
Part-based Graph
In general, a part-based graph can be constructed as a combination of subgraphs where each subgraph has certain properties that define it.
图的划分的定义:
We consider scenarios in which the partitions can share vertices or have edges connecting them.
即,一个图被划分为不同的子图,不同的子图会共享顶点或共享边。
- is the partition (or subgraph) of the graph
two parts (b):
- Axial skeleton
- Appendicular skeleton
four parts (c ) (推荐):
- head
- hands
- torso
- legs
We consider left and right parts of hands and legs together in order to be agnostic to laterality [31] (handedness / footedness) of the human when performing an action.
即,排除侧向性的干扰(左手招手和右手招手都是招手)。
six part (d) :
we divide the upper and lower components of appendicular skeleton into left and right (shown in Figure 1(d)), resulting in six parts
子图的连接:
图的连接有两种方式:点连接 & 边连接。此处采用的是点连接。
To cover all natural connections between joints in skeleton graph, we include an overlap of at least one joint between two adjacent parts.
即,每个子图之间有至少有一个公用的node。
Part-based Graph Convolutions
不同于上述提到的单图的卷积公式(Eq.2) ,划分为子图后,graph有新的卷积公式。
同时,有几个概念需要重新定义。
邻域:
- 空间邻域(Spatial neighbor):单个 frame 下(特定时间)一阶邻域(Figure 3(a))。
- 时间邻域(Temporal neighbor):单个 node 的 不同的时间的位置(Figure 3(a))。
- 时空邻域(Spatial-temporal neighbor):时空邻域的并集(Figure 3(b))。
卷积:
graph convolutions over a part identifies the properties of that subgraph and an aggregation across subgraphs learns the relations between them.
For a part-based graph, convolutions for each part are performed separately and the results are combined using an aggregation function
即,先通过子图内卷积(一阶邻域),再通过聚合函数计算各子图的联系。
公式表达如下:
子图卷积:
- can be shared across parts or kept separate, while the neighbors of only in that part () are considered
子图卷积结果聚合:
边共享形式:
顶点共享形式:
Spatio-temporal Part-based Graph Convolutions
卷积的步骤
The S-videos are represented as spatio-temporal graphs.
即,S-video 的本质是 spatio-temporal graphs.
we spatially convolve each partition independently for each frame, aggregate them at each frame and perform temporal convolution on the temporal dimension of the aggregated graph.
即大致分为两步,细致可分为3步:
- Spatial convolution(空间卷积):
- 子图卷积:spatially convolve each partition independently for each frame
- 子图卷积结果聚合:aggregate result of partition convolution at each frame
- Temporal convolution(时间卷积):
- 对聚合结果进行时间卷积:temporal convolution on the temporal dimension of the aggregated graph。
邻域的划分
For each vertex, we use 1-neighborhood ( = 1) for spatial dimension () as the skeleton graph is not very large and a -neighborhood ( = ) for the temporal dimension ( ), is not part-specific.
空间邻域和时间邻域的划分,由下式表示:
标签的给定
For ordering vertices in the receptive fields (or neighborhoods), we use a single label spatially ( to weigh vertices in of each vertex equally and labels temporally () to weigh vertices across frames in differently.
即,对于 root 节点,空间邻域内 label 相同(为0),时间邻域内 label 不同。
公式表达如下:
卷积的全部公式!!!
子图的空间卷积
- :part-specific channel transform kernel (pointwise operation)
- for each part is same but is part-specific
- :output from applying on input features at each vertex
- :normalized adjacency matrix for part
- :temporal convolution kernel
子图空间卷积的聚合
- :output obtained after aggregating all partition graphs at one frame
时域卷积
g}({Y_1(v_{it}),…,Y_n(v_{it})})
$$
- :output obtained after aggregating all partition graphs at one frame
时域卷积
- :output after applying temporal convolution on output of τ frames