Jasmeet Singh: CSCE 625 Artificial Intelligence Final Project Portfolio

Hand Gesture Recognition using Leap-Motion Controller

Last Name: SINGH
First Name: JASMEET
UIN: 523005618

What is Leap Motion Controller and What can we do with it?

In the recently organized hackathon called TamuHack at Texas A&M University I came across this interesting controller called Leap-Motion. I was instantly impressed by its capabilities and used it to make my hack which used this controller to track position of a finger tip and draw its trail. This could be used as a drawing pen or a signature input device.

So what is Leap Motion? It is a small device connected to a computer via a USB cable and can track positions of your hands and fingers without any contact. Here are the picture to make things clear. The picture on the right shows the axis notion used by the controller.

All you have to do is place your hand above the controller and it will give raw data consisting coordinates in 3-d, directions of all your hand and finger joints. It consists of 2 small Cameras with 3 cm distance between them and an internal controller to process the images and give you the coordinates. To give you a perspective of what goes on, I clicked this picture with a leap-motion visualizer app running on my computer. You can see my hands and the equivalent bone structure. The spherical points are the joints whose coordinates the sensor provides.

Visual Illustration of Leap Motion data

The pic above is just a visualization of the raw data, the raw data itself looks like this:

Raw data received from Leap Motion Controller

That's a lot of information for one hand and the interesting part is the controller is capable of computing coordinates of up-to 4 hands simultaneously at upto 200 fps.

If you look closely at the raw data, It provides coordinates and direction vectors for center of hand, wrist, Thumb, Index Finger, Middle Finger, Ring Finger and Pinky Finger. It breaks each finger into four parts as are the joints inside our fingers namely Metacarpal, Proximal, Intermediate, Distal. It gives start and end coordinates of each joint. The pic below shows these joints in a hand:

Bone structure of hand with bone names

Now say, if I want to get the coordinates of the fingertips I can easily look at the end coordinates of the Distal bones. The accuracy of the controller claimed by the manufactures is 0.1 mm. This accuracy is very high if you compare it to a similar sensor from Microsoft called the Kinect which can barely differentiate between fingers. Kinect is more suited for body sized gestures and not for intricate hand and finger gestures.

Motivation

This is so much data about our hands, and I wanted to make some good use of it. That's how I got inspired to make a gesture recognition tool for this controller. The applications are immense and change the way we interact with our computers. I decided to build a tool to learn the gestures and build a model which can be used to classify hand gestures online or offline. I will compare performance of different types of classifiers and how selecting different feature set from the raw data can improve the performance of the classifiers. I am only looking at static hand gestures in this project.

Gesture Recognition? It isn't new!

Hand Gesture recognition is a well researched topic and there are practical implementations which can classify simple hand gestures with very high accuracy. A lot of different input methods have also been researched. For example, Hand Gesture Recognition from 2-D images or from 3-D Mesh Models. All these input methods require a lot of pre-processing before they can actually be fed to a classifier. Using 3-D Mesh Models of hands, the pre-processing of data step can be very computationally expensive and is rarely used for online classification applications

Classification using images works very well if the images have very less noise and the hand can be distinctively segmented out of the background. Moreover only a 16x16 pixel image can give 256 dimensional data to be fed to the classifier.

And here we have a simple neat controller which gives the coordinates of all finger tips and palm in (x,y,z) with very high accuracy. If I am able to achieve high accuracy for say 10 distinct gestures, it can be very easily used for online applications (like games). The classifier should be very fast as the dimensionality of the input data is very small and the data itself is very accurate. Moreover it gives coordinates in 3-D which increases it's capability to distinguish more gestures than classifying using 2-D images.

Let's get into it!

Leap Motion provides SDK in various languages (C++, python, Java) to extract raw data from the controller. I am using the python implementation. I will be using to scikit python module to use pre-built classifiers and test their performance. Following are the captured images of the 10 gestures that I am trying to classify:

I view gesture recognition as a supervised learning problem, in which we provide a set of training examples, and then test the accuracy of our classifier using a new set of testing examples. The raw data contains a lot of information which might not be needed at all. For example, we can classify the above gestures based only on the finger tip positions. So, selecting which features to choose is depended purely on the applications. In this case I started with a very small feature set from the raw data. I separate out the (x,y,z) coordinates of the palm and all the 5 fingers. This reduces the dimensionality of the data to 18. So the basic flow is as follows:

Leap Motion --> Pre-Processing --> Classifier --> Recognized Gesture

Capturing Data Set

Leap motion controller could capture data at the rate of 100 fps in my computer. I captured data continuously for 10 seconds for each gesture while varying my hand position, rotation, pitch and yaw continuously. This gives highly varied set of 1000 samples for each gesture. I captured training and testing data separately. Testing data was varied much more than the training data in terms of hand position, rotation, pitch and yaw. The same data was collected for 3 different people and all data were merged into a single collection of training dataset and testing dataset. This dataset contains data only for right hand, a similar dataset can be built for left hand also. The dataset is available for open use and is available with the source code (Link given in last section).

Classifiers

I will perform tests using the following two classifiers provided by the scikit python library:

1) k Nearest Neighbors Classifier (or kNN Classifier): In kNN classification the input consists of the k closest training examples in the feature space. The output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. (source: wikipedia).

kNN classifiers require a large amount of training examples if the dimensionality is large. This phenomenon is called curse of dimensionality.

2) Support Vector Machine Classifier (or SVM Classifier): Support Vector Machines belong to the discriminant model family: they try to find a combination of samples to build a plane maximizing the margin between the two classes. Classes are not always linearly separable in feature space. The solution is to build a decision function that is not linear but may be polynomial instead. This is done using the kernel trick that can be seen as creating a decision energy by positioning kernels on observations: (source: scikit documentation).

For examples, we could use a Liner Kernel, a higher degree Polynomial Kernel or a RBF (Radial Basis Function) Kernel. The image below shows how the above specified kernels can separate two or more classes. (from left to right: Lienar, Polynomial 3rd Degree, RBF).

1) Linear Kernel, 2) Polynomial Kernel 3) RBF Kernel

(source: scikit documentation)

Pre-Processing, Feature Set Selection

In the pre-processing step the raw data is reduced to a set of desired features which can uniquely differentiated between the gestures. The idea behind reducing the set of features is so that the classifier is only fed with important information which results into much better models built by the classifier. The better the feature set, the faster will be the performance of classification. So a small amount of computation done at the pre-processing step may result in significant increase in accuracy of the classifier. In the text below I discuss three different types of pre-processing done and comparison between them their capabilities.

1) Feature Set 1: As indicated in the previous sections, I started by reducing the raw data to a smaller feature set consisting (x,y,z) coordinates of palm, and 5 finger tips. This feature set doesn't take into account the position of the hand with respect to the controller. In this feature set the same gesture could result into a completely different set of coordinates. So it this feature set should suffer a lot from the variety of hand positions in my testing data set. Variation in hand sizes of the testing data set will also affect the performance of this feature set.

2) Feature Set 2: In this feature set I shift the origin to the palm of the hand, so all other coordinates will be relative to the center of the hand (palm) and not with respect to the leap motion controller. This should increase the performance compared to feature set 1 as now even if the testing data contains a familiar gesture but with shifted coordinates, this feature set will be able to classify it.

But still this feature set doesn't take into account the rotation of the hand in terms of pitch, roll and yaw angles. A same gesture would result into a completely different values of coordinates if we have a different orientation of the hand.

If you are totally unaware of what is pitch, roll and yaw, you should not be surprised because these notions are used in context of an airplane. Here is how these angles look in context of an airplane:

Yaw Pitch and Roll in context of an airplane

The similar concept is applied here. The 3 pics below explain what these angles mean in context of leap motion controller.

Pitch Angle with respect to negative z axis and positive y axis.

Roll Angle with respect to positive x and positive y axis.

Yaw Angle with respect to negative z axis and positive x axis.

The pitch, roll and yaw angles with respect to the leap motion x, y and z axis are included in the raw data from the leap motion controller. So, if all the coordinates of the fingertips are compensated for the pitch, roll and yaw angles we will always get coordinates with origin at palm and 0 pitch, roll and yaw angles.

3) Feature Set 3: This feature set is an extension to the feature set 2 and with the Pitch, roll and yaw angles compensated. The conversion is fairly simple to do and can be done by rotating the involved axis by the negative of the pitch, roll or yaw angle.

If a rectangular xy-coordinate system is rotated through an angle θ to form an x' y' coordinate system, then a point P(x, y) will have coordinates P(x', y') in the new system, where (x, y) and (x', y') are related by (source: http://math.sci.ccny.cuny.edu/document/show/2685):

x = x' cos θ − y' sin θ and y = x' sin θ + y' cos θ.

and

x' = x cos θ + y sin θ and y' = −x sin θ + y cos θ.

This feature set results in a very accurate representation of hand coordinates even if your hand is rotated in any orientation with respect to the leap motion controller.

Performance Evaluation

kNN clsasifier and SVM classifiers were trained on the training dataset that I captured. Then these classifiers were tested on the testing dataset. I evaluated performance as the percentage accuracy in prediction. The following table shows the comparison between kNN and SVM classifiers. SVM classifiers are varied with different types of kernels viz.: Linear, 3rd Degree Polynomial, 5th Degree Polynomial, RBF Kernel. The same experiment was carried out by using the three different feature sets as discussed above. The table below shows the percentage correctness in prediction.

	kNN	SVM (Linear)	SVM (Poly Degree 3)	SVM (Poly Degree 5)	SVM (RBF)
Feature Set 1	72.2695	76.7552	78.4178	80.9594	48.2315
Feature Set 2	85.9925	84.6542	85.2349	86.0159	58.1259
Feature Set 3	93.8432	95.8923	96.1193	96.5792	67.4786

I further tried to investigate how the percentage accuracy varies with the degree polynomial used in SVM classifier. As the degree approached 5 or 6, I got maximum accuracy, however accuracy started getting reduced after degree 6 polynomial. The table below summarizes the results:

	SVM (Linear)	SVM (Poly) (Degree 2)	SVM (Poly) (Degree 3)	SVM (Poly) (Degree 4)	SVM (Poly) (Degree 5)	SVM (Poly) (Degree 6)	SVM (Poly) (Degree 7)	SVM (Poly) (Degree 8)
Feature Set 1	76.7552	77.2948	78.4178	79.5784	80.9594	81.0143	79.9932	79.7322
Feature Set 2	84.6542	84.9023	85.2349	85.8932	86.0159	86.2334	86.0005	85.4543
Feature Set 3	95.8923	96.0089	96.1193	96.3423	96.5792	96.8902	96.1023	95.6783

These observations are theoretically true, as the accuracy of a prediction increases till certain polynomial degree and then it decreases if polynomial degree is further increased.

The training time of all the classifiers was under 1 second for a training set of 10000 training samples, except for SVM with RBF kernel which took around 55 seconds to train. The prediction time of these classifiers is also well under a second to predict around 10000 testing samples, except for SVM with RBF kernel which takes around 30 seconds to complete all predictions.

Conclusion

The prediction for SVM classifiers was above 95% using feature set 3. This much accuracy is quite usable, but still these numbers are for only 10 gestures. Image based classifiers can perform better than this for 10 gestures, but we have to take into account image classification being in 2-D whereas this one is in 3-D. Accuracy can be further improved by producing even improved feature sets. A group of Leap Motion controller hobbyists had used angles between fingers as feature set and achieved 99% accuracy for 10 gestures.

(http://www.cs.put.poznan.pl/wjaskowski/pub/theses/LeapGesture_BScThesis.pdf)

A lot of accuracy loss can also be accounted to the noise in data collection as I collected samples continuously at the rate they are generated. However a good way will be to wait for a sample to stabilize to a particular value before sampling it. This procedure will take a lot of time in collecting a large varied dataset, but will further increase the accuracy with quite a good margins.

Future Work

I would really like to continue improving my classifier and aim at releasing a library which other people can use. I am releasing my code and datasets used in this project for experimentation.

Source Code and DataSets: https://www.dropbox.com/s/ulzd6mg77pbq5y7/aiproject.zip?dl=0

Jasmeet Singh

Tuesday, December 16, 2014

CSCE 625 Artificial Intelligence Final Project Portfolio