State Farm Distracted Driver Detection Proposal


Nowadays a lot of drivers are not focused on the driving. They might do something that are not related to the driving such as drinking, sending messages, playing games, etc. According to related reports, about 20% accidents occur because of the distracted drivers. About 42500 people injure and 3000 people die for this every year. It is quite dangerous. Our group decide to do a project named ‘State Farm Distracted Driver Detection’ which will predict the status of the drivers in the cars. We choose this because we think it makes sense to the reality.

This project comes from the kaggle competition. And the kaggle also provides the users with the dataset. In this project, we have some images that taken in the car with a driver doing something in it (texting, eating, talking on the phone, makeup, reaching behind, etc). And our purpose is to predict the likelihood of what the driver is doing in each picture.

There are some models which keras already provides such as VGG16, VGG19, ResNet50, Inception, Xception. We will give a brief introduction to some of the models and try to fuse them to get a better result in this project.

As we get the result, it will be evaluated using the multi-class logarithmic loss. Each image has been labeled with one true class. So for each image, we will give a set of predicted probabilities (one for every image). We will compare our Score with the others on kaggles, To judge whether our method is suitable or not.


Caring about the safety of public places, group behavior recognition method based on image processing has been widely used. Focusing on driving safety, a well-known bus accident occurred in * on 10.02.2018 took away 16 people’s lives. The consequence of most traffic accidents is human factors like weary drive and some drivers pay no attention on the rules of traffic. To avoid traffic accident which is caused by human factors, it is necessary to set up a system to detect people’s behavior when they are driving.1

State Farm Distracted Driver Detection Proposal

The distracted_driver_detection is the topic of kaggle competition,we choose it as our team project.The main idea of this project is the following method.What we have to do is detecting the driver’s gesture in the photo. There are ten different gestures, such as safe driving, texting, drinking, organizing etiquette and making phone calls. We consider it as a problem of image processing and classification.

Kaggle is a platform provides machine learning competition, database management and code sharing for developer and data scientist. Distracted driver test is one of the competition last run about 2 years ago.

Background to the research

Kaggle have given the data set,so that we can use this existing resource to train. It really save us time a lot.Although the given DS limited the system method we will build, ,we still search the research status before making up ours mind.Since the dirver detection has several different method,we suppose different technology and system may inspire us to design our own project. Different idea and method were used in the following ingenious design.

According to the article’Real-Time Detection of Driver Cognitive Distraction,Using Support Vector Machines’.The authors try to uesd SVMs develop a real-time approach for detecting cognitive distraction using drivers’ eye movements and driving performance data.2 While They collected the data and then uesd it to train and test both SVM and logistic regression models.But the limitation is obviously.the eye tracker may loss accuracy while the car in the variable enviroment,such as the complex lighting conditions.And authors illustrated the delay which maybe the most diffcult problem for application.3The paper didn’t quantify the lags precisely.Although there were still some limitation in the eye-tracker system,it still an fancy way to monitor the driver for reduce the accident with an average accuracy of 81.1%.

In the paper ‘Systems approach to the design of locomotive fatigue management technologies’,Aboukhalil and Anton’s main idea is to combine information from an infrared eyelid monitor and a generic Train Sentry class activity monitor, to isolate the common drowsiness component and obtain an improved estimate of the operator’s state.Instead of image processing,they uesed Signal Detection Theory (SDT) engineering combine with a stochastic framework.This method is design for the train drivers that have many difference compared with the car drivers. But it still show the different idea for detection.4

Benchmark Model

Since the introduction of Alexnet in 2012, a series of fields such as image classification and target detection have been ruled by the convolutional neural network CNN. In the following time, people continue to design new deep learning network models to get better training results. In general, many network architecture improvements (eg, from VGG to RESNET can bring further performance improvements to many different computer vision domains.These CNN models have a common problem: they are computationally intensive. The earliest AlexNet contains 60M parameters, and the subsequent VGGNet parameters are roughly three times that of AlexNet. In 14 years, GoogLe proposed GoogleNet with only 5M parameters, and the effect is comparable to AlexNet. Although there are some computational techniques that can reduce the amount of computation, it will increase the complexity of the model invisibly. Models with few parameters have great advantages in some very large data or memory-constrained scenarios.

Therefore, the question now is whether there are several ways to maintain the sparseness of the network structure and to take advantage of the high computational performance of dense matrices. A large number of literatures have shown that sparse matrices can be clustered into denser sub-matrices to improve computational performance.Firstly,based on this paper, a structure called Inception is proposed to achieve this goal.Secondly, Xception is another improvement to Inception v3 proposed by Google following Inception. It mainly uses the depthwise separable convolution to replace the convolution operation in the original Inception v3.


Inception’s earliest paper focused on a new building block for deep networks, now called the "Inception module."This module is derived from the intersection of two thought insights.The first Insight is Why not let the model choose?

For each layer, Inception Module performs a 5×5 convolutional transformation, a 3×3 convolutional transformation, and a maximum pooling.This brings up a prominent problem: the cost of computing is greatly increased. To address the above computational bottleneck, it involves the second insight. Inception’s authors used a 1×1 convolution to “filter” the depth of the output. A 1×1 convolution looks at only one value at a time, but on multiple channels, it can extract spatial information and compress it to a lower dimension.

The first version of Inception was GoogLeNet and the 22-layer network that won the ILSVRC 2014 competition mentioned earlier. One year later, the researchers developed Inception v2 and v3 in the second paper and implemented several improvements in the original version—the most notably reconstruction of larger convolutions into continuous ones. Smaller convolutions make learning easier. For example, in v3, the 5×5 convolution is replaced by two consecutive 3×3 convolutions.Inception quickly became a decisive model architecture. The latest version, Inception v4, even puts a residual connection into each module, creating an Inception-ResNet hybrid structure. But more importantly, Inception demonstrates the well-designed “network-on-net” architecture that takes the neural network’s representation capabilities to the next level.5


Xception means “extreme inception”. Like the previous two architectures, it reshapes the way we look at neural networks—especially convolutional networks. And as its name suggests, it pushes the principles of Inception to the extreme.Its assumption is: “The cross-channel correlation and spatial correlation are completely separable, and it is best not to map them together.”

The filter considers both a spatial dimension (each 2×2 colored square) and a cross-channel or “depth” dimension (a stack of 4 squares). At the input level of the input image, this is equivalent to a convolution filter that looks at a 2 × 2 pixel block on all three RGB channels.

That question is coming: What reason do we have to consider image areas and channels at the same time?In Inception, we started to separate the two slightly. We use a 1×1 convolution to project the original input into multiple separate smaller input spaces, and for each of these input spaces we use a different type of filter to make this data smaller. The 3D module performs the transformation. Xception goes one step further. Instead of splitting the input data into several compressed data blocks, the spatial correlation is mapped separately for each output channel, and then a 1×1 depth convolution is performed to obtain cross-channel correlation.

To sum up,on the ImageNet dataset, Xception performed slightly better than Inception v3, and performed much better on a larger image classification dataset with 17,000 classes. Most importantly, it has as many model parameters as Inception, indicating that it is more computationally efficient.

VGG16 & VGG19

VGGNet is a deep convolution neural network developed by researchers from the Visual Geometry Group at Oxford University and Google DeepMind. Nowadays, people usually use VGG 16 and VGG 19. The following is an introduction about them.6

16 representations in VGG-16 network means that the structure has 16 layers, including 13 convolution layers and 3 full link layers. The parameters are mainly in 3 full link layers with a total parameter of about 138 million.

Without so many hyper-parameters, VGG-16 network structure is very orderly, aiming at building a simple network. All the small convolution cores of 33 and the maximum pool level of 22 are used.Several convolution layers is followed by a pool layer, which is able to compress image size and constantly deepening the network structure to improve the performance.

  • Convolution layers:CONV=3*3 filters, s = 1, padding = same convolution
  • Pool layers:MAX_POOL = 2*2 , s = 2

VGG-16 has its advantages and disadvantages. On one hand, it simplified convolutional neural network structure. On the other hand, the number of trained features is very large, which means it costs very long time and occupies more computer memory.

With the deepening of the network, the width and height of the images are decreasing with certain rules. After each pooling, the number of channels is reduced by half, and the number of channels doubled.

Besides VGG-16, VGG-19 is also often used. The difference between the two is mainly in the number of convolution layers, but the performance is almost the same, so we basically use VGG-16.

  • VGG167

State Farm Distracted Driver Detection Proposal

  • VGG198

State Farm Distracted Driver Detection Proposal

Luckily, Keras already have these models in Keras Applications :

  • Xception
  • VGG16
  • VGG19
  • ResNet50
  • InceptionV3
  • InceptionResNetV2
  • MobileNet

Project Design

Project Confirmation and solutions

 After several discussions, the team members decided to select Driver Status Monitoring in the kaggle competition as the topic of this group project. It also decided to use the deep learning method to detect the driver’s state according to the project requirements, and accurately input the driver’s driving state by inputting a picture of the colored driver’s driving process.

Finding Information

Find the data of the learning model that may be used in the project, and compare the differences between models to select a training model that is more suitable for achieving the objectives of the project. And by looking at the projects that others have completed in the kaggle competition, learn from others’ project experiences in order to achieve a better project outcome.

Datasets and Inputs

The data set is derived from the kaggle competition. The data set processing mainly includes a series of methods such as gathering data source, feature extraction, feature dimension reduction, and null value processing. We will process and correct the training set used in the project in advance, mainly related to the size, angle, driver’s position, status and other factors. For example, randomizing the image (rotation, scaling) to prevent over-fitting or under-fitting of the data, and randomizing the order of the data to ensure that the order of the data does not affect the learning effect, so that each learning is mutually independent. Secondly, it is necessary to distribute the weights of the data to ensure that the data in each distraction state is relatively average, and there is no particular state of a certain state is frequent, and the other is particularly rare.

Looking into the dataset, the 10 classes to predict are:

  • c0: safe driving
  • c1: texting - right
  • c2: talking on the phone - right
  • c3: texting - left
  • c4: talking on the phone - left
  • c5: operating the radio
  • c6: drinking
  • c7: reaching behind
  • c8: hair and makeup
  • c9: talking to passenger

State Farm Distracted Driver Detection Proposal

To see above, we can see that each category has about 2000 to 2500 pictures, the number of each category is not much different.

Model building and training

Environment preference

  • python 3.5
  • tensorflow-gpu
  • keras
  • opencv3

Establish a training model. The project plans to adopt models such as VGG, ResNet50, Inception, and Xception. The structure of VGG is very neat. It uses a small convolution kernel to increase the depth of the network and enhance the effect of the neural network. Inception determines its own parameters by its network, which can increase the width of the network and adapt to multi-scale. The amount of calculation is reduced to some extent. Xception is based on Inception-V3 and combines depth-wise-convolution to improve network efficiency, make it more powerful on large-scale data sets, and leverage hardware resources. The project Uses the training set to train 3 models, analyzes the results, try to tune the used model, or try to fuse multiple models to improve the accuracy of the model. Through continuous training, the model parameters are finally determined to be closer to the real prediction.

CAM Visualization

In reality, the accuracy of the model is on the one hand, and we also know the reason for the result. Therefore, we will use CAM (Class Activation Mapping) to visualize the driver’s body part (ie, the identification area of the driver’s state) that the neural network is concerned with. Such as the driver’s face, hands and so on.

Evaluation Metrics

The project uses the evaluation method LogLoss used in the kaggle competition to evaluate the trained model on the test set and analyze the evaluation results. Using the never-learned data of the test set, we can see how our model performs on these never-used data, to some extent reflecting the real-world performance of the model.


Project Summary

The team members think about the problems that arise during the project and propose improvements. Summarize the experience learned in this project and apply it to future projects.

  1. 高玄, 刘勇奎, 汪大峰. 基于图像处理的人群行为识别方法综述[J]. 计算机与数字工程, 2016, 44(8):1557-1562. ↩︎

  2. Liang Y, Reyes M L, Lee J D. Real-time detection of driver cognitive distraction using support vector machines[J]. IEEE transactions on intelligent transportation systems, 2007, 8(2): 340-350. ↩︎

  3. 王莹. 基于表情及姿态的机车司机疲劳驾驶监测技术 [D]. 北京交通大学, 2012. ↩︎

  4. Aboukhalil A. Systems approach to the design of locomotive fatigue management technologies /[J]. Massachusetts Institute of Technology, 2006.Modle introduction ↩︎

  5. Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 2818-2826. ↩︎

  6. 尹宝才, 王文通, 王立春. 深度学习研究综述[J]. 北京工业大学学报, 2015, 41(1): 48-59. ↩︎

  7. VGG16 Picture Source: Building powerful image classification models using very little data ↩︎

  8. VGG19 Picture Source: Applied Deep Learning 11/03 Convolutional Neural Networks ↩︎