A brief scanning of paper “Spatial Transformer Network”

Part.1: what’s the problems the method proposed in this paper solved?

 

When we using a convolution network to extract features of image data, we hope the features extracted is accurate and non redundancy. For example a rotated feature can be expressed as an affined original feature. If we can auto revise deformed feature to its original form, when we use convolution network to extracted features, the network maybe need less channels to express the features of data and gain a higher accurate in the final task for a revised image is easier to be classified or done as some similar task. Method proposed in this paper is that using some inserted Spatial Transformer Layer which accept deformed features as input then output revise features in a convolution network, then training the Spatial Transformer Layer with the task, finally get well trained Spatial Transformer Layers which output feature to its right direction, position and scales, and help the convolution network gain a better output.

 A brief scanning of paper “Spatial Transformer Network”

 

Part.2: what does the Spatial Transformer Layer looks like?

In overview, Spatial Transformer Layer looks like this:

 A brief scanning of paper “Spatial Transformer Network”

U is affined features input (This paper only discuss about affined features), V is output features revised by Spatial Transformer Layer. The Spatial Transformer Layer is constructed with a localisation net and a Grid generator. The localisation net is constructed with some stacked convolution layers followed with a regression layer which means to regression a parameter vector θ for Grid generator. The Grid generator use the affine function parameter with θ to revise features U to V which is in a right direction, position and scale. This phrase can be read as the processing of a sampler.

 

Part.3: details of sampler:

 

A: forward propagation

A brief scanning of paper “Spatial Transformer Network”(1)

Explanation in 2-D feature case. When we want to get the values of output V in point wise, we scan V, get the value of vi f(vi is a point of V, expressed as (x_i_t, y_i_t)) through the Tθ given upon. And the Tθ is just the former of a affine transform function. Scan all the Vi, we get the value of output V.

 

B: optimize of θ

We can use Gradient Descent method to optimize θ. So if we get the expression of d(Value_vi(θ))/d(θ), we solve this problem.

Function Tθ map Sequence vi to Sequence ui (ui is a point of U, expressed as (x_i_s, y_i_s)), val(ui) is the value of ui. So when vi is given, mapped ui is decided by θ, the value of vi is as the same with ui( ui can be known by formulation (1) ):

Value_vi(θ) = Val(ui(θ))

so the differential formulation presented as followed(vi is given):

 

d(Val(ui(θ)))= d(θ)*(d(ui(θ))/d(θ))*(d(Val(ui)) /d(ui))  

 

 A brief scanning of paper “Spatial Transformer Network”

 

d(ui(θ))/d(θ) can be derivation by formulation (1),

d(Val(ui)) /d(ui) can be calculated by the value of points near ui. So we can calculate the expression d(Value_vi(θ))/d(θ), and use it to optimize the θ.

 

Part.4: experiment

 A brief scanning of paper “Spatial Transformer Network”