Hands-On Machine Learning with Scikit-Learn & TensorFlow Exercise Q&A Chapter05
Q1. What is the fundamental idea behind Support Vector Machines?
A1: The fundamental idea of SVM is to maximize the margin between the decision boundaries and separate the training instances into two classes as perfectly as possible. For soft margin classification, the SVM searches for a compromise between perfectly separating the two classes and having the widest possilble margin.
Q2. What is a support vector?
A2: A support vector is any instance located on the "street", including its border, which means they are the nearest to the hyperplane. The dicision boundary is entirely determined by the support vectors. Computing the predictions only involves the support vectors, not the whole training set.
Q3. Why is it important to scale the inputs when using SVMs?
A3: SVMs try to fit the largest possible margin between two classes, and the predictions entirely influenced by the support vectors, so if the training set is not scaled, the SVM will tend to neglect the small features.
Q4. Can an SVM classifier output a confidence score when it classifies an instance? What about a probability?
A4: An SVM classifier can output the distance between the test instance and the dicision boundary, and you can use this as a confidence score. But it cannot output the probability, if we set probability=True, it will using Logistic Regression to calibrate the probabilities.
Q5. Should you use the primal or the dual form of the SVM problem to train a model on a training set with millions of instances and hundreds of features?
A5: Because in this problem, there are too many training instances, and the computational complexity of the primal form is proportional to the training numbers m, while the computational complexity of the dual form is proportional to the or
, so it's too big, for millions of training instances to use dual form. So we need to use the primal form.
Q6. Say you trained an SVM classifier with an RBF kernel. It seems to underfit the training set: should you increase or decrease gamma? What about C?
A5: We need to increase gamma or C or both.
Q7. How should you set the QP parameters (H, f, A, and b) to solve the soft margin linear SVM classifier problem using an off-the-shelf QP solver?
A7: We set the QP parameters for the hard margin problem as H, f, A, and b. The QP parameters for the soft margin problem will have m additional parameters and m additional constrains.
Thus, H is equal to H plus m columns of 0s on the right and m rows of 0s at the bottom. f is euqal to f with m additional elements all equal to the value of hyperparameter C. A is equal to A with extra m*m identity matrix appended to the right. b is equal to b with m additional elements all equal to 0.
Q8. Train a LinearSVC on a linearly separable dataset. Then train an SVC and a SGDClassifier on the same dataset. See if you can get them to produce roughly the same model.
A8:
First, I get the data. Use the iris data for test.
from sklearn import datasets
iris = datasets.load_iris()
X = iris["data"][:, (2, 3)] # petal length, petal width
y = iris["target"]
setosa_or_versicolor = (y == 0) | (y == 1)
X = X[setosa_or_versicolor]
y = y[setosa_or_versicolor]
Then, I use three classifiers to train the data and ouput the models: (Never forget to scale the data)
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
C = 5
alpha = 1 / (C * len(X))
lin_clf = LinearSVC(loss="hinge", C=C, random_state=42)
svm_clf = SVC(kernel="linear", C=C)
sgd_clf = SGDClassifier(loss="hinge", learning_rate="constant", eta0=0.001, alpha=alpha,
max_iter=100000, tol=-np.infty, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
lin_clf.fit(X_scaled, y)
svm_clf.fit(X_scaled, y)
sgd_clf.fit(X_scaled, y)
print("LinearSVC: ", lin_clf.intercept_, lin_clf.coef_)
print("SVC: ", svm_clf.intercept_, svm_clf.coef_)
print("SGDClassifier(alpha={:.5f}):".format(sgd_clf.alpha), sgd_clf.intercept_, sgd_clf.coef_)
Finally, plot the decision boundaries of three models:
# Compute the slope and bias of each decision boundary
w1 = -lin_clf.coef_[0, 0]/lin_clf.coef_[0, 1]
b1 = -lin_clf.intercept_[0]/lin_clf.coef_[0, 1]
w2 = -svm_clf.coef_[0, 0]/svm_clf.coef_[0, 1]
b2 = -svm_clf.intercept_[0]/svm_clf.coef_[0, 1]
w3 = -sgd_clf.coef_[0, 0]/sgd_clf.coef_[0, 1]
b3 = -sgd_clf.intercept_[0]/sgd_clf.coef_[0, 1]
# Transform the decision boundary lines back to the original scale
line1 = scaler.inverse_transform([[-10, -10 * w1 + b1], [10, 10 * w1 + b1]])
line2 = scaler.inverse_transform([[-10, -10 * w2 + b2], [10, 10 * w2 + b2]])
line3 = scaler.inverse_transform([[-10, -10 * w3 + b3], [10, 10 * w3 + b3]])
# Plot all three decision boundaries
plt.figure(figsize=(11, 4))
plt.plot(line1[:, 0], line1[:, 1], "k:", label="LinearSVC")
plt.plot(line2[:, 0], line2[:, 1], "b--", linewidth=2, label="SVC")
plt.plot(line3[:, 0], line3[:, 1], "r-", label="SGDClassifier")
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs") # label="Iris-Versicolor"
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo") # label="Iris-Setosa"
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="upper center", fontsize=14)
plt.axis([0, 5.5, 0, 2])
plt.show()
Apparently, they are almost the same!
Q9. Train an SVM classifier on the MNIST dataset. Since SVM classifiers are binary classifiers, you will need to use one-versus-all to classify all 10 digits. You may want to tune the hyperparameters using small validation sets to speed up the process. What accuracy can you reach?
A9:
Firstly, get the data, split the train set and test set:
try:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)
except ImportError:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
X = mnist["data"]
y = mnist["target"]
X_train = X[:60000]
y_train = y[:60000]
X_test = X[60000:]
y_test = y[60000:]
Shuffle the training set:
np.random.seed(42)
rnd_idx = np.random.permutation(60000)
X_train = X_train[rnd_idx]
y_train = y_train[rnd_idx]
Don't forget to scale the data:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float32))
X_test_scaled = scaler.transform(X_test.astype(np.float32))
Train the model:
lin_clf = LinearSVC(random_state=42)
lin_clf.fit(X_train_scaled, y_train)
Then I can see the result of accuracy score:
y_pred = lin_clf.predict(X_train_scaled)
accuracy_score(y_train, y_pred)
92.3%, that's pretty good!
As it said, we can use one-versud-all instead:
svm_clf = SVC(decision_function_shape="ovr", gamma="auto")
svm_clf.fit(X_train_scaled[:10000], y_train[:10000])
y_pred = svm_clf.predict(X_train_scaled)
accuracy_score(y_train, y_pred)
We can use grid search or other methods to fine-tune the hyperparameters, but it will overfit.
Q10. Train an SVM regression on the California housing dataset.
A10:
Get the data, split into training set and test set, scale the data, train a model, test the model, check the model's RMSE:
Then we can other models instead of LinearSVR to do this task.