Sreehari


周末项目:使用scikit-learn进行手语和静态手势识别 (Weekend project: sign language and static-gesture recognition using scikit-learn)

Let’s build a machine learning pipeline that can read the sign language alphabet just by looking at a raw image of a person’s hand.


This problem has two parts to it:


  1. Building a static-gesture recognizer, which is a multi-class classifier that predicts the static sign language gestures.

  2. Locating the hand in the raw image and feeding this section of the image to the static gesture recognizer (the multi-class classifier).


You can get my example code and dataset for this project .


首先,一些背景。 (First, some background.)

Gesture recognition is an open problem in the area of machine vision, a field of computer science that enables systems to emulate human vision. Gesture recognition has many applications in improving human-computer interaction, and one of them is in the field of Sign Language Translation, wherein a video sequence of symbolic hand gestures is translated into natural language.

手势识别是机器视觉领域的一个悬而未决的问题,机器视觉是使系统能够模仿人类视觉的计算机科学领域。 手势识别在改善人机交互方面具有许多应用,其中之一是手语翻译领域,其中将象征性手势的视频序列翻译为自然语言。

A range of advanced methods for the same have been developed. Here, we’ll look at how to perform static-gesture recognition using the scikit learn and scikit image libraries.

已经开发出了一系列相同的高级方法。 在这里,我们将研究如何使用scikit Learn和scikit图像库执行静态手势识别。

第1部分:构建静态手势识别器 (Part 1: Building a static-gesture recognizer)

For this part, we use a data set comprising raw images and a corresponding csv file with coordinates indicating the bounding box for the hand in each image. ()

对于这一部分,我们使用一个数据集,该数据集包括原始图像和相应的csv文件,该文件的坐标指示每个图像中手的边界框。 ( )

This data set is organized user-wise and the directory structure of the dataset is as follows. The image names indicate the alphabet represented by the image.

该数据集是按用户组织的,数据集的目录结构如下。 图像名称表示图像代表的字母。

dataset   |----user_1          |---A0.jpg          |---A1.jpg          |---A2.jpg          |---...          |---Y9.jpg   |----user_2          |---A0.jpg          |---A1.jpg          |---A2.jpg          |---...          |---Y9.jpg   |---- ...   |---- ...

The static-gesture recognizer is essentially a multi-class classifier that is trained on input images representing the 24 static sign-language gestures (A-Y, excluding J).


Building a static-gesture recognizer using the raw images and the csv file is fairly simple.


To use the multi-class classifiers from the scikit learn library, we’ll need to first build the data set — that is, every image has to be converted into a feature vector (X) and every image will have a label corresponding to the sign language alphabet that it denotes (Y).


The key now is to use an appropriate strategy to vectorize the image and extract meaningful information to feed to the classifier. Simply using the raw pixel values will not work if we plan on using simple multi-class classifiers (as opposed to using Convolution Networks).

现在的关键是使用适当的策略对图像进行矢量化并提取有意义的信息以馈送到分类器。 如果我们计划使用简单的多类分类器(与使用卷积网络相反),则仅使用原始像素值将无法工作。

To vectorize our images, we use the Histogram of Oriented Gradients (HOG) approach, as it has been proven to yield good results on problems such as this one. Other feature extractors that can be used include Local Binary Patterns and Haar Filters.

为了对我们的图像进行矢量化处理,我们使用了“定向直方图”(HOG)方法,因为它已被证明可以在诸如此类的问题上产生良好的结果。 可以使用的其他特征提取器包括本地二进制模式和Haar过滤器。

码: (Code:)

We use pandas in the get_data() function to load the CSV file. Two functions-crop() and convertToGrayToHog() are used to get the required hog vector and append it to the list of vectors that we’re building, in order to train the multi-class classifier.

我们在get_data()函数中使用了pandas来加载CSV文件。 两个函数-crop() 并convertToGrayToHog() 用于获取所需的猪向量并将其附加到我们正在构建的向量列表中,以训练多类分类器。

# returns hog vector of a particular image vectordef convertToGrayToHOG(imgVector):    rgbImage = rgb2gray(imgVector)    return hog(rgbImage)    # returns cropped image def crop(img, x1, x2, y1, y2, scale):    crp=img[y1:y2,x1:x2]    crp=resize(crp,((scale, scale)))     return crp    #loads data for multiclass classificationdef get_data(user_list, img_dict, data_directory):  X = []  Y = []    for user in user_list:    user_images = glob.glob(data_directory+user+'/*.jpg')        boundingbox_df = pd.read_csv(data_directory + user + '/' + user + '_loc.csv')            for rows in boundingbox_df.iterrows():      cropped_img = crop( img_dict[rows[1]['image']],                          rows[1]['top_left_x'],                          rows[1]['bottom_right_x'],                          rows[1]['top_left_y'],                          rows[1]['bottom_right_y'],                          128                        )       hogvector = convertToGrayToHOG(cropped_img)              X.append(hogvector.tolist())       Y.append(rows[1]['image'].split('/')[1][0])           return X, Y

The next step is to encode the output labels (the Y-values) to numerical values. We do this using sklearn’s label encoder.

下一步是将输出标签(Y值)编码为数值。 我们使用sklearn的标签编码器执行此操作。

In our code, we have done this as follows:


Y_mul = self.label_encoder.fit_transform(Y_mul)

where, the label_encoder object is constructed as follows within the gesture-recognizer class constructor:


self.label_encoder = LabelEncoder().fit(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y'])

Once this is done, the model can be trained using any Multi-class classification algorithm of your choice from the scikit learn toolbox. We have trained ours using , with a linear kernel.

完成此操作后,可以使用scikit学习工具箱中选择的任何多分类算法对模型进行训练。 我们已经使用 (带有线性核)对我们进行了训练。

Training a model using sklearn does not involve more than two lines of code. Here’s how you do it:

使用sklearn训练模型不会涉及超过两行代码。 这是您的操作方式:

svcmodel = SVC(kernel='linear', C=0.9, probability=True) self.signDetector = svcmodel.fit(X_mul, Y_mul)

The hyperparameters (i.e., C=0.9 in this case) can be tuned using a Grid Search. Read more about this .

可以使用网格搜索来调整超参数(在这种情况下,C = 0.9)。 阅读有关此内容的更多信息。

In this case, we do not know a whole lot about the data as such (i.e., the hog vectors). So, it’d be a good idea to try and use algorithms like xgboost (Extreme Gradient Boosting) or Random Forest Classifiers and see how these algorithms perform.

在这种情况下,我们对这样的数据(即猪矢量)一无所知。 因此,尝试使用诸如xgboost(极端梯度增强)或Random Forest Classifiers之类的算法并查看这些算法的性能是一个好主意。

第2部分:构建本地化程序 (Part 2: Building the Localizer)

This part requires a slightly more effort as compared to the first.


Broadly, we’ll employ the following steps in completing this task.


  1. Build a data set comprising images of hands and parts that are not-hand, using the given data set and the bounding box values for each image.

    使用给定的数据集和每个图像的边界框值, 构建一个包含手和非手部图像的数据集。

  2. Train a binary classifier to detect hand/not-hand images using the above data set.


  3. (Optional) Use Hard Negative Mining to improve the classifier.

    (可选)使用“ 硬否定挖掘”来改进分类器。

  4. Use a with various scales, on the query image to isolate the region of interest.


Here, we are not going to be using any image processing techniques like filtering, color segmentation, etc. The scikit image library is used to read, crop, scale, convert images to gray scale and extract hog vectors.


建立手/不手数据集: (Building the hand/not hand dataset:)

The data set could be built using any strategy you like. One way to do this, is to generate random coordinates and then check the ratio of area of intersection to area of union (i.e., the degree of overlap with the given bounding box) to determine if it is a non-hand section. (Another approach could be to use a sliding window to determine the coordinates. But this is horribly slow and unnecessary)

可以使用您喜欢的任何策略来构建数据集。 一种方法是生成随机坐标,然后检查相交面积与并集面积的比率(即与给定边界框的重叠程度),以确定其是否为非手工剖面。 (另一种方法可能是使用滑动窗口来确定坐标。但这非常缓慢且不必要)

"""This function randomly generates bounding boxes Returns hog vector of those cropped bounding boxes along with label Label : 1 if hand ,0 otherwise """def buildhandnothand_lis(frame,imgset):    poslis =[]    neglis =[]        for nameimg in frame.image:        tupl = frame[frame['image']==nameimg].values[0]        x_tl = tupl[1]        y_tl = tupl[2]        side = tupl[5]        conf = 0                dic = [0, 0]                arg1 = [x_tl,y_tl,conf,side,side]                poslis.append( convertToGrayToHOG(crop(imgset[nameimg],  x_tl,x_tl+side,y_tl,y_tl+side)))                while dic[0] <= 1 or dic[1] < 1:            x = random.randint(0,320-side)            y = random.randint(0,240-side)             crp = crop(imgset[nameimg],x,x+side,y,y+side)            hogv = convertToGrayToHOG(crp)            arg2 = [x,y, conf, side, side]                        z = overlapping_area(arg1,arg2)            if dic[0] <= 1 and z <= 0.5:                neglis.append(hogv)                dic[0] += 1            if dic[0]== 1:                break        label_1 = [1 for i in range(0,len(poslis)) ]        label_0 = [0 for i in range(0,len(neglis))]        label_1.extend(label_0)        poslis.extend(neglis)                return poslis,label_1

训练二元分类器: (Training a binary classifier:)

Once the data set is ready, training the classifier can be done exactly as seen before in part 1.


Usually, in this case, a technique called is employed to reduce the number of false positive detections and improve the classifier. One or two iterations of hard negative mining using a Random Forest Classifier, is enough to ensure that your classifier reaches acceptable classification accuracies, which in this case is anything above 80%.

通常,在这种情况下,采用一种称为“ 的技术来减少误报检测的次数并改善分类器。 使用随机森林分类器进行一两次硬性否定挖掘足以确保您的分类器达到可接受的分类精度,在这种情况下,该精度是80%以上。

Take a look at the .

在查看该 。

在测试图像中检测手: (Detecting hands in test images:)

Now, to actually use the above classifier, we scale the test image by various factors and then use a on all of them to pick the window which captures the region of interest perfectly. This is done by selecting the region corresponding to the max of the confidence scores allotted by the binary (hand/not-hand) classifier across all scales.

现在,要实际使用上述分类器,我们可以通过各种因素缩放测试图像,然后对所有因素使用来选择能够完美捕获感兴趣区域的窗口。 这是通过选择与二进制(手/不手)分类器在所有标度上分配的置信度得分的最大值相对应的区域来完成的。

The test images need to be scaled because, we run a set sized window (in our case, it is 128x128) across all images to pick the region of interest and it is possible that the region of interest does not fit perfectly into this window size.


and .

和 。

全部放在一起 (Putting it all together)

After both parts are complete, all that’s left to do is to call them in succession to get the final output when provided with a test image.


That is, given a test image, we first get the various detected regions across different scales of the image and pick the best one among them. This region is then cropped out, rescaled (to 128x128) and its corresponding hog vector is fed to the multi-class classifier (i.e., the gesture recognizer). The gesture recognizer then predicts the gesture denoted by the hand in the image.

也就是说,给定一张测试图像,我们首先获取图像不同比例尺上的各种检测区域,然后从中选择最佳区域。 然后将该区域裁剪出来,重新缩放(缩放为128x128),并将其相应的猪矢量输入多类分类器(即手势识别器)。 然后,手势识别器预测图像中由手表示的手势。

关键点 (Key points)

To summarize, this project involves the following steps. The links refer to the relevant code in the github repository.

总而言之,该项目涉及以下步骤。 这些链接引用了github存储库中的相关代码。

  1. .

and I worked on this project as part of the Machine Learning course that we took up in college. A big shout out to her for all her contributions!

和我参与了该项目,这是我们在大学学习的机器学习课程的一部分。 感谢她的所有贡献!

Also, we wanted to mention , which is a wonderful blog that we used extensively while we were working on the project! Do check it out for content on image processing and opencv related content.

另外,我们想提及 ,这是一个很棒的博客,在我们从事该项目时我们广泛使用它! 一定要检查一下有关图像处理的内容以及与opencv相关的内容。





