Overview

With the kernel I am trying to run a simple test on using Siamese networks for similarity on a slightly more complicated problem than standard MNIST. The idea is to take a randomly initialized network and apply it to images to find out how similar they are. The models should make it much easier to perform tasks like Visual Search on a database of images since it will have a simple similarity metric between 0 and 1 instead of 2D arrays.

In [1]:
import numpy as np
import os
import pandas as pd
from keras.preprocessing.image import ImageDataGenerator
from keras.utils.np_utils import to_categorical
import matplotlib.pyplot as plt
Using TensorFlow backend.
/opt/conda/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
  return f(*args, **kwds)

Load and Organize Data

Here we load and organize the data so we can easily use it inside of Keras models

In [2]:
from sklearn.model_selection import train_test_split
data_train = pd.read_csv('../input/fashion-mnist_train.csv')
X_full = data_train.iloc[:,1:]
y_full = data_train.iloc[:,:1]
x_train, x_test, y_train, y_test = train_test_split(X_full, y_full, test_size = 0.3)

In [3]:
x_train = x_train.values.reshape(-1, 28, 28, 1).astype('float32') / 255.
x_test = x_test.values.reshape(-1, 28, 28, 1).astype('float32') / 255.
y_train = y_train.values.astype('int')
y_test = y_test.values.astype('int')
print('Training', x_train.shape, x_train.max())
print('Testing', x_test.shape, x_test.max())
Training (42000, 28, 28, 1) 1.0
Testing (18000, 28, 28, 1) 1.0

In [4]:
# reorganize by groups
train_groups = [x_train[np.where(y_train==i)[0]] for i in np.unique(y_train)]
test_groups = [x_test[np.where(y_test==i)[0]] for i in np.unique(y_train)]
print('train groups:', [x.shape[0] for x in train_groups])
print('test groups:', [x.shape[0] for x in test_groups])
train groups: [4165, 4155, 4162, 4196, 4258, 4246, 4239, 4184, 4230, 4165]
test groups: [1835, 1845, 1838, 1804, 1742, 1754, 1761, 1816, 1770, 1835]

Batch Generation

Here the idea is to make usuable batches for training the network. We need to create parallel inputs for the AA and BBimages where the output is the distance. Here we make the naive assumption that if images are in the same group the similarity is 1 otherwise it is 0.

If we randomly selected all of the images we would likely end up with most images in different groups.

In [5]:
def gen_random_batch(in_groups, batch_halfsize = 8):
    out_img_a, out_img_b, out_score = [], [], []
    all_groups = list(range(len(in_groups)))
    for match_group in [True, False]:
        group_idx = np.random.choice(all_groups, size = batch_halfsize)
        out_img_a += [in_groups[c_idx][np.random.choice(range(in_groups[c_idx].shape[0]))] for c_idx in group_idx]
        if match_group:
            b_group_idx = group_idx
            out_score += [1]*batch_halfsize
        else:
            # anything but the same group
            non_group_idx = [np.random.choice([i for i in all_groups if i!=c_idx]) for c_idx in group_idx] 
            b_group_idx = non_group_idx
            out_score += [0]*batch_halfsize
            
        out_img_b += [in_groups[c_idx][np.random.choice(range(in_groups[c_idx].shape[0]))] for c_idx in b_group_idx]
            
    return np.stack(out_img_a,0), np.stack(out_img_b,0), np.stack(out_score,0)

Validate Data

Here we make sure the generator is doing something sensible, we show the images and their similarity percentage.

In [6]:
pv_a, pv_b, pv_sim = gen_random_batch(train_groups, 3)
fig, m_axs = plt.subplots(2, pv_a.shape[0], figsize = (12, 6))
for c_a, c_b, c_d, (ax1, ax2) in zip(pv_a, pv_b, pv_sim, m_axs.T):
    ax1.imshow(c_a[:,:,0])
    ax1.set_title('Image A')
    ax1.axis('off')
    ax2.imshow(c_b[:,:,0])
    ax2.set_title('Image B\n Similarity: %3.0f%%' % (100*c_d))
    ax2.axis('off')

Feature Generation

Here we make the feature generation network to process images into features. The network starts off randomly initialized and will be trained to generate useful vector features from input images (hopefully)

In [7]:
from keras.models import Model
from keras.layers import Input, Conv2D, BatchNormalization, MaxPool2D, Activation, Flatten, Dense, Dropout
img_in = Input(shape = x_train.shape[1:], name = 'FeatureNet_ImageInput')
n_layer = img_in
for i in range(2):
    n_layer = Conv2D(8*2**i, kernel_size = (3,3), activation = 'linear')(n_layer)
    n_layer = BatchNormalization()(n_layer)
    n_layer = Activation('relu')(n_layer)
    n_layer = Conv2D(16*2**i, kernel_size = (3,3), activation = 'linear')(n_layer)
    n_layer = BatchNormalization()(n_layer)
    n_layer = Activation('relu')(n_layer)
    n_layer = MaxPool2D((2,2))(n_layer)
n_layer = Flatten()(n_layer)
n_layer = Dense(32, activation = 'linear')(n_layer)
n_layer = Dropout(0.5)(n_layer)
n_layer = BatchNormalization()(n_layer)
n_layer = Activation('relu')(n_layer)
feature_model = Model(inputs = [img_in], outputs = [n_layer], name = 'FeatureGenerationModel')
feature_model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
FeatureNet_ImageInput (Input (None, 28, 28, 1)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 26, 26, 8)         80        
_________________________________________________________________
batch_normalization_1 (Batch (None, 26, 26, 8)         32        
_________________________________________________________________
activation_1 (Activation)    (None, 26, 26, 8)         0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 24, 24, 16)        1168      
_________________________________________________________________
batch_normalization_2 (Batch (None, 24, 24, 16)        64        
_________________________________________________________________
activation_2 (Activation)    (None, 24, 24, 16)        0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 16)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 10, 10, 16)        2320      
_________________________________________________________________
batch_normalization_3 (Batch (None, 10, 10, 16)        64        
_________________________________________________________________
activation_3 (Activation)    (None, 10, 10, 16)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 8, 8, 32)          4640      
_________________________________________________________________
batch_normalization_4 (Batch (None, 8, 8, 32)          128       
_________________________________________________________________
activation_4 (Activation)    (None, 8, 8, 32)          0         
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 4, 4, 32)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                16416     
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
batch_normalization_5 (Batch (None, 32)                128       
_________________________________________________________________
activation_5 (Activation)    (None, 32)                0         
=================================================================
Total params: 25,040
Trainable params: 24,832
Non-trainable params: 208
_________________________________________________________________

Siamese Model

We apply the feature generating model to both images and then combine them together to predict if they are similar or not. The model is designed to very simple. The ultimate idea is when a new image is taken that a feature vector can be calculated for it using the FeatureGenerationModel. All existing images have been pre-calculated and stored in a database of feature vectors. The model can be applied using a few vector additions and multiplications to determine the most similar images. These operations can be implemented as a stored procedure or similar task inside the database itself since they do not require an entire deep learning framework to run.

In [8]:
from keras.layers import concatenate
img_a_in = Input(shape = x_train.shape[1:], name = 'ImageA_Input')
img_b_in = Input(shape = x_train.shape[1:], name = 'ImageB_Input')
img_a_feat = feature_model(img_a_in)
img_b_feat = feature_model(img_b_in)
combined_features = concatenate([img_a_feat, img_b_feat], name = 'merge_features')
combined_features = Dense(16, activation = 'linear')(combined_features)
combined_features = BatchNormalization()(combined_features)
combined_features = Activation('relu')(combined_features)
combined_features = Dense(4, activation = 'linear')(combined_features)
combined_features = BatchNormalization()(combined_features)
combined_features = Activation('relu')(combined_features)
combined_features = Dense(1, activation = 'sigmoid')(combined_features)
similarity_model = Model(inputs = [img_a_in, img_b_in], outputs = [combined_features], name = 'Similarity_Model')
similarity_model.summary()
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
ImageA_Input (InputLayer)       (None, 28, 28, 1)    0                                            
__________________________________________________________________________________________________
ImageB_Input (InputLayer)       (None, 28, 28, 1)    0                                            
__________________________________________________________________________________________________
FeatureGenerationModel (Model)  (None, 32)           25040       ImageA_Input[0][0]               
                                                                 ImageB_Input[0][0]               
__________________________________________________________________________________________________
merge_features (Concatenate)    (None, 64)           0           FeatureGenerationModel[1][0]     
                                                                 FeatureGenerationModel[2][0]     
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 16)           1040        merge_features[0][0]             
__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 16)           64          dense_2[0][0]                    
__________________________________________________________________________________________________
activation_6 (Activation)       (None, 16)           0           batch_normalization_6[0][0]      
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 4)            68          activation_6[0][0]               
__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 4)            16          dense_3[0][0]                    
__________________________________________________________________________________________________
activation_7 (Activation)       (None, 4)            0           batch_normalization_7[0][0]      
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, 1)            5           activation_7[0][0]               
==================================================================================================
Total params: 26,233
Trainable params: 25,985
Non-trainable params: 248
__________________________________________________________________________________________________

In [9]:
# setup the optimization process
similarity_model.compile(optimizer='adam', loss = 'binary_crossentropy', metrics = ['mae'])

Visual Model Feedback

Here we visualize what the model does by taking a small sample of randomly selected A and B images the first half from the same category and the second from different categories. We then show the actual distance (0 for the same category and 1 for different categories) as well as the model predicted distance. The first run here is with a completely untrained network so we do not expect meaningful results.

In [10]:
def show_model_output(nb_examples = 3):
    pv_a, pv_b, pv_sim = gen_random_batch(test_groups, nb_examples)
    pred_sim = similarity_model.predict([pv_a, pv_b])
    fig, m_axs = plt.subplots(2, pv_a.shape[0], figsize = (12, 6))
    for c_a, c_b, c_d, p_d, (ax1, ax2) in zip(pv_a, pv_b, pv_sim, pred_sim, m_axs.T):
        ax1.imshow(c_a[:,:,0])
        ax1.set_title('Image A\n Actual: %3.0f%%' % (100*c_d))
        ax1.axis('off')
        ax2.imshow(c_b[:,:,0])
        ax2.set_title('Image B\n Predicted: %3.0f%%' % (100*p_d))
        ax2.axis('off')
    return fig
# a completely untrained model
_ = show_model_output()

In [11]:
# make a generator out of the data
def siam_gen(in_groups, batch_size = 32):
    while True:
        pv_a, pv_b, pv_sim = gen_random_batch(train_groups, batch_size//2)
        yield [pv_a, pv_b], pv_sim
# we want a constant validation group to have a frame of reference for model performance
valid_a, valid_b, valid_sim = gen_random_batch(test_groups, 1024)
loss_history = similarity_model.fit_generator(siam_gen(train_groups), 
                               steps_per_epoch = 500,
                               validation_data=([valid_a, valid_b], valid_sim),
                                              epochs = 10,
                                             verbose = True)
Epoch 1/10
500/500 [==============================] - 61s 122ms/step - loss: 0.6475 - mean_absolute_error: 0.4596 - val_loss: 0.5082 - val_mean_absolute_error: 0.3765
Epoch 2/10
500/500 [==============================] - 62s 125ms/step - loss: 0.5057 - mean_absolute_error: 0.3619 - val_loss: 0.4097 - val_mean_absolute_error: 0.2911
Epoch 3/10
500/500 [==============================] - 62s 124ms/step - loss: 0.4534 - mean_absolute_error: 0.3099 - val_loss: 0.3535 - val_mean_absolute_error: 0.2392
Epoch 4/10
500/500 [==============================] - 63s 126ms/step - loss: 0.4163 - mean_absolute_error: 0.2806 - val_loss: 0.3348 - val_mean_absolute_error: 0.2129
Epoch 5/10
500/500 [==============================] - 62s 124ms/step - loss: 0.4000 - mean_absolute_error: 0.2643 - val_loss: 0.3252 - val_mean_absolute_error: 0.2093
Epoch 6/10
500/500 [==============================] - 62s 124ms/step - loss: 0.3865 - mean_absolute_error: 0.2524 - val_loss: 0.3139 - val_mean_absolute_error: 0.2002
Epoch 7/10
500/500 [==============================] - 57s 114ms/step - loss: 0.3862 - mean_absolute_error: 0.2520 - val_loss: 0.3087 - val_mean_absolute_error: 0.2068
Epoch 8/10
500/500 [==============================] - 54s 108ms/step - loss: 0.3654 - mean_absolute_error: 0.2395 - val_loss: 0.3098 - val_mean_absolute_error: 0.1921
Epoch 9/10
500/500 [==============================] - 53s 106ms/step - loss: 0.3677 - mean_absolute_error: 0.2368 - val_loss: 0.3099 - val_mean_absolute_error: 0.1943
Epoch 10/10
500/500 [==============================] - 54s 108ms/step - loss: 0.3660 - mean_absolute_error: 0.2347 - val_loss: 0.3044 - val_mean_absolute_error: 0.1942

In [12]:
_ = show_model_output()

T-Shirt vs Ankle Boot-Plot

Here we take an random t-shirt and ankle boot (categories 0 and 9) images and calculate the distance using our network to the other images

In [13]:
t_shirt_vec = np.stack([train_groups[0][0]]*x_test.shape[0],0)
t_shirt_score = similarity_model.predict([t_shirt_vec, x_test], verbose = True, batch_size = 128)
ankle_boot_vec = np.stack([train_groups[-1][0]]*x_test.shape[0],0)
ankle_boot_score = similarity_model.predict([ankle_boot_vec, x_test], verbose = True, batch_size = 128)
18000/18000 [==============================] - 21s 1ms/step
18000/18000 [==============================] - 20s 1ms/step

In [14]:
obj_categories = ['T-shirt/top','Trouser','Pullover','Dress',
                  'Coat','Sandal','Shirt','Sneaker','Bag','Ankle boot'
                 ]
colors = plt.cm.rainbow(np.linspace(0, 1, 10))
plt.figure(figsize=(10, 10))

for c_group, (c_color, c_label) in enumerate(zip(colors, obj_categories)):
    plt.scatter(t_shirt_score[np.where(y_test == c_group), 0],
                ankle_boot_score[np.where(y_test == c_group), 0],
                marker='.',
                color=c_color,
                linewidth='1',
                alpha=0.8,
                label=c_label)
plt.xlabel('T-Shirt Dimension')
plt.ylabel('Ankle-Boot Dimension')
plt.title('T-Shirt and Ankle-Boot Dimension')
plt.legend(loc='best')
plt.savefig('tshirt-boot-dist.png')
plt.show(block=False)

Examining the Features

Here we aim to answer the more general question: did we generate useful features with the Feature Generation model? And how can we visualize this.

In [15]:
x_test_features = feature_model.predict(x_test, verbose = True, batch_size=128)
18000/18000 [==============================] - 11s 612us/step

Neighbor Visualization

For this we use the TSNE neighborhood embedding to visualize the features on a 2D plane and see if it roughly corresponds to the groups. We use the test data for this example as well since the training has been contaminated

In [16]:
%%time
from sklearn.manifold import TSNE
tsne_obj = TSNE(n_components=2,
                         init='pca',
                         random_state=101,
                         method='barnes_hut',
                         n_iter=500,
                         verbose=2)
tsne_features = tsne_obj.fit_transform(x_test_features)
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 18000 samples in 0.226s...
[t-SNE] Computed neighbors for 18000 samples in 4.028s...
[t-SNE] Computed conditional probabilities for sample 1000 / 18000
[t-SNE] Computed conditional probabilities for sample 2000 / 18000
[t-SNE] Computed conditional probabilities for sample 3000 / 18000
[t-SNE] Computed conditional probabilities for sample 4000 / 18000
[t-SNE] Computed conditional probabilities for sample 5000 / 18000
[t-SNE] Computed conditional probabilities for sample 6000 / 18000
[t-SNE] Computed conditional probabilities for sample 7000 / 18000
[t-SNE] Computed conditional probabilities for sample 8000 / 18000
[t-SNE] Computed conditional probabilities for sample 9000 / 18000
[t-SNE] Computed conditional probabilities for sample 10000 / 18000
[t-SNE] Computed conditional probabilities for sample 11000 / 18000
[t-SNE] Computed conditional probabilities for sample 12000 / 18000
[t-SNE] Computed conditional probabilities for sample 13000 / 18000
[t-SNE] Computed conditional probabilities for sample 14000 / 18000
[t-SNE] Computed conditional probabilities for sample 15000 / 18000
[t-SNE] Computed conditional probabilities for sample 16000 / 18000
[t-SNE] Computed conditional probabilities for sample 17000 / 18000
[t-SNE] Computed conditional probabilities for sample 18000 / 18000
[t-SNE] Mean sigma: 0.097702
[t-SNE] Computed conditional probabilities in 1.213s
[t-SNE] Iteration 50: error = 82.1846161, gradient norm = 0.0019173 (50 iterations in 27.468s)
[t-SNE] Iteration 100: error = 80.4134293, gradient norm = 0.0010669 (50 iterations in 26.792s)
[t-SNE] Iteration 150: error = 79.5910645, gradient norm = 0.0007335 (50 iterations in 27.382s)
[t-SNE] Iteration 200: error = 79.0950394, gradient norm = 0.0005696 (50 iterations in 27.344s)
[t-SNE] Iteration 250: error = 78.7620468, gradient norm = 0.0004646 (50 iterations in 27.556s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 78.762047
[t-SNE] Iteration 300: error = 3.2698922, gradient norm = 0.0012206 (50 iterations in 27.653s)
[t-SNE] Iteration 350: error = 2.7692475, gradient norm = 0.0006349 (50 iterations in 27.760s)
[t-SNE] Iteration 400: error = 2.4634285, gradient norm = 0.0004027 (50 iterations in 27.423s)
[t-SNE] Iteration 450: error = 2.2674994, gradient norm = 0.0002826 (50 iterations in 27.257s)
[t-SNE] Iteration 500: error = 2.1313024, gradient norm = 0.0002141 (50 iterations in 26.836s)
[t-SNE] Error after 500 iterations: 2.131302
CPU times: user 9min 7s, sys: 1min 10s, total: 10min 18s
Wall time: 4min 39s

In [17]:
obj_categories = ['T-shirt/top','Trouser','Pullover','Dress',
                  'Coat','Sandal','Shirt','Sneaker','Bag','Ankle boot'
                 ]
colors = plt.cm.rainbow(np.linspace(0, 1, 10))
plt.figure(figsize=(10, 10))

for c_group, (c_color, c_label) in enumerate(zip(colors, obj_categories)):
    plt.scatter(tsne_features[np.where(y_test == c_group), 0],
                tsne_features[np.where(y_test == c_group), 1],
                marker='o',
                color=c_color,
                linewidth='1',
                alpha=0.8,
                label=c_label)
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('t-SNE on Testing Samples')
plt.legend(loc='best')
plt.savefig('clothes-dist.png')
plt.show(block=False)

In [18]:
feature_model.save('fashion_feature_model.h5')

In [19]:
similarity_model.save('fashion_similarity_model.h5')

Image Similarity Siamese Network