Facial recognition has become an increasingly common technology.

Today smartphones use facial recognition for access control, and animated movies use facial recognition software to bring realistic human movement and expression to life. Police surveillance cameras use it to identify people who have warrants out for their arrest, and it is also being used in retail stores for targeted marketing campaigns. And of course, celebrity look-a-like apps and Facebook’s auto tagger also uses facial recognition to tag faces.

Not all facial recognition libraries are equal in accuracy and performance, and most state-of-the-art systems are proprietary black boxes.

OpenFace is an open-source library that rivals the performance and accuracy of proprietary models. This project was created with mobile performance in mind, so let’s look at some of the internals that make this library fast and accurate, and work through some use cases on why you might want to implement it in your projects.

Open Face Overview

OpenFace is a deep learning facial recognition model developed by Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan,and James Philbin at Google, and is implemented using Python and Torch so it can be run on CPUs or GPUs.

The following overview shows the workflow for a single input image of Sylvester Stallone from the publicly available.

  1. Detect faces with a pre-trained models from dlib or OpenCV.
  2. Transform the face for the neural network. This repository uses dlib’s real-time pose estimation with OpenCV affine transformation to try to make the eyes and bottom lip appear in the same location on each image.
  3. Use a deep neural network to represent (or embed) the face on a 128-dimensional unit hypersphere. The embedding is a generic representation for anybody’s face. Unlike other face representations, this embedding has the nice property that a larger distance between two face embedding’s means that the faces are likely not of the same person. This property makes clustering, similarity detection, and classification tasks easier than other face recognition techniques where the Euclidean distance between features is not meaningful.
  4. Apply your favourite clustering or classification techniques to the features to complete your recognition task. See below for our examples for classification and similarity detection.

Figure 1Referred BY https://cmusatyalab.github.io/openface/#overview

From a high level perspective, OpenFace uses Torch, a scientific computing framework to do training offline, meaning it’s only done once by OpenFace and the user doesn’t have to get their hands dirty training hundreds of thousands of images themselves. Those images are then thrown into a neural net for feature extraction using Google’s Face Net model. Face Net relies on a triplet loss function to compute the accuracy of the neural net classifying a face and is able to cluster faces because of the resulting measurements on a hypersphere.

This trained neural net is later used in the Python implementation after new images are run through dlib’s face-detection model. Once the faces are normalized by OpenCV Affine transformation so all faces are oriented in the same direction, they are sent through the trained neural net in a single forward pass. This results in 128 facial embedding’s used for classification for matching or can even be used in a clustering algorithm for similarity detection.

Training

During the training portion of the OpenFace pipeline, 500,000 images are passed through the neural net. These images are from two public datasets: CASIA-Web Face, which is comprised of 10,575 individuals for a total of 494,414 images and Face Scrub, which is made of 530 individuals with a total of 106,863 images.

The point of training the neural net on all these images ahead of time is that it wouldn’t be possible on mobile or any other real-time scenario to train 500,000 images to retrieve the needed facial embedding’s. Now remember, this portion of the pipeline is only done once because OpenFace trains these images to produce 128 facial embedding’s that represent a generic face that are to be later used in the Python training-on-the-fly part of the pipeline. Then instead of matching an image in high-dimensional space, you’re only using low-dimensional data, which helps make this model fast.

As mentioned before, OpenFace uses Google’s Face Net architecture for feature extraction and uses a triplet loss function to test how accurate the neural net classifies a face. It does this by training on three different images where one is a known face image called the anchor image, then another image of that same person has positive embedding’s, while the last one is an image of a different person, which has negative embedding’s.

The great thing about using triple embedding’s is that the embedding’s are measured on a unit hypersphere where Euclidean distance is used to determine which images are closer together and which are farther apart. Obviously, the negative image embedding’s are measured farther from the positive and anchor embedding’s while those two would be closer in distance to each other. This is important because it allows for clustering algorithms to be used for similarity detection. You might want to use a clustering algorithm if you wanted to detect family members on a genealogy site for example, or on social media for possible marketing campaigns.

Pre-processing

machine learning

Figure 2Referred BY https://alitarhini.wordpress.com/2010/12/05/face-recognition-an-introduction/

Along with finding each face in an image, part of the process in facial recognition is pre-processing the images to handle problems such as inconsistent and bad lighting, converting images to grayscale for faster training, and normalization of facial position.

While some facial recognition models can handle these issues by training on massive datasets, dlib uses OpenCV 2D affine transformation, which rotates the face and makes the position of the eyes, nose, and mouth for each face consistent. There are 68 facial landmarks used in affine transformation for feature detection, and the distances between those points are measured and compared to the points found in an average face image. Then the image is rotated and transformed based on those points to normalize the face for comparison and cropped to 96×96 pixels for input to the trained neural net.

Classification

So, after we isolate the image from the background and pre-process it using dlib and OpenCV, we can pass the image into the trained neural net that was created in the Torch portion of the pipeline. In this step, there is a single forward pass on the neural net to get 128 embedding’s (facial features) that are used in prediction. These low-dimensional facial embedding’s are then used in classification or clustering algorithms.

For classification tests, OpenFace uses a linear support-vector machine that is commonly used out in the real world to match image features. An impressive thing about OpenFace is that at this point it takes only a few milliseconds to classify images.

Written By
Amol Jagdambe