Motion Recognition via Image Differencing

This program computes the image difference between successive images and analyzes the amount of change and determines which of the several pre-defined motions was performed.

// Compile with g++ -o p1-Robert main.c vision.c -lm
// Run with: p1-Robert frame#-camera#.pgm frame#-camera#.pgm 1/0 camera 1 data

Video-Based Activity Recognition

The programming task of this assignment was to design and implement a program that recognizes activities in the Sensorium, the “smart room” at BU, using three webcams. The activities to be identified were:
1. Standing, then crouching, then standing again.
2. Standing and waving arms that are extended sideways.
3. Standing and waving arms that are extended forward.

My approach to this problem was to use the background differencing techniques proposed by Bobick for MIT’s smart room. I first converted all the ppm files from the webcam captures to pgm. The program takes consecutive frames (pgms), figures out which pixels have changed by subtracting the pixel values of the first image from the second image. This in conjunction with some threshold values on pixel values to get rid of noise, results in an image where pixels that have stayed the same are white and pixels that have changed are represented on screen as some grayscale value.

Then using this new image we detect the first and last zero crossings, or where the image goes from white to non-white, in both the x and y directions, giving us min(X,Y) and max(X,Y). Using this information we can compute the total area of the image that changed from one frame to the next. The basic idea being that certain activities will cause more of the image to change. Using the data gathered and the area from the various resulting change-of-motion images, I was to compute some basic ratios of the area changes for each activity to be able recognize which was occurring.

For example crouching resulting in an area calculation that would be less then someone waving their arms side to side for camera’s from a specific viewpoint. Here are some examples.

Setup:
The camera setup for this program was two cameras pointing directly at the subject and a third off to the side at a diagonal. Cameras 0 and 2 where facing the subject, and camera 1 was to the side.
Data Analysis:
The pictures consisted for each pose 15 images taken, 5 frames for each camera. The file format being frame#-camera#.pgm. Each was pose saved in a separate folder and 4 poses were taken:
Standing, then crouching, then standing again.
Standing and waving arms that are extended sideways.
Standing and waving arms that are extended forward facing toward the cameras.
Standing and waving arms that are extended forward facing to the side of the cameras.

Being some of the images were not reliable because of capturing frames at the wrong time of a motion, I didn’t know when it was capturing frames, and that the image capture method wasn’t so precise some images were disregarded from the data set.

Here is a list of the data that was applicable to this project:
those not listed were either unreliable because of the camera angle
or because of problems in coordinating time of frame capture and doing movement

For some reason wordpress, hates multiple tables of this kind, hopefully you’ll get the idea though without them. *** means N/A

Crouching Camera0 Camera1 Camera2

Base Case: 0-0 0-1 0-2

Tested Against:

3-0 *** 2-2

*** *** 3-2

Side Flap: Camera0 Camera1 Camera2

Base Case: 0-0 0-1 0-2

Tested Against:

*** 1-1 2-2

*** 2-1 3-2

*** 3-1 4-2

*** 4-1 ***

Front Flap: Camera0 Camera1 Camera2

Base Case: 0-0 0-1 0-2

Tested Against:

*** 1-1 ***

3-0 2-1 3-2

4-0 *** 4-2

Front Flap:side view Camera0 Camera1 Camera2

Base Case: 0-0 0-1 0-2

Tested Against

1-0 *** 1-2

2-0 2-1 ***

3-0 3-1 ***

*** 4-1 ***

There were no false negatives or false positives on the data listed in the tables.However, for the other frames the results were:
False Negatives: Didn’t miss detecting any activities, only was wrong about which were going on sometimes.
False Positives: 4 images per camera per reference frame = 4*3 comparisons = 12
Crouch: Over all data: 6/12 For Applicable Data: 0/3
Y: Over all data: 3/12 For Applicable Data 0 /7
FlapFront: Over all data: 6/12 For Applicable Data 0/6
FlapSide: Over all data: 5/12 For Applicable Data 0/7
Note: These false positives were the result of bad data, discounting data from non-optimal camera angles for certain motions for specific cameras and frames capturing motion that the program was not designed to recognize. Update: To translate for graders who didn’t understand, those tables show that for every picture that I deemed was usable, that is not corrupt or misleading in someway, the program correctly identified the motion taking place in the pictures.

Program Limitations, Assumptions, and Difficulties:
The first problem I encountered was in using the image difference technique of subtracting pixel values of one image from another, was there was a lot of background noise. I first ran my program and then analyzed the image in gimp. Using gimp I was able to determine the noise was occurring when pixel values were between 230 and 254. So I set any pixel that was off white to be exactly white. This got rid of the noise, such that the output image consisted of only the changed pixels and the rest being white. Also I checked for each pixel if their value was less then an error threshold I had set, if they were I set them to be equal to the highest value of image1. This combined with the threshold for off white pixels eliminated all noise.

Camera 1, or the camera off to the side was problematic for certain poses. For crouching it produced too much change in the output image. Thus for crouching all data from camera1 should be discounted because in the program it appeared not to fit well within the boundaries of the other poses. So we’ll assume that since the front views of camera0 and camera 2 work good for crouching, the third camera is redundant.

Also, for camera1 there were difficulties for the flap front poses, both of them. This however, was accounted for by the third input variable if the data is from camera1 or not. This is because the data behaved predictably and though camera1 is non-optimal for this pose it can still calculate what is going on.

Assumptions:
This program only works for pictures taken against a fairly uniform background and for with cameras setup in the same position and files named accordingly. Also it is unknown whether this program would work for other people of various heights and weights and various color clothing.

In addition, the original intent of using ratios of different area values was disregarded given the limited data. Instead the values that were used were general patterns in the size of the resulting change of area image for certain poses. For example, it was noted in the data for images in the Y shaped pose their area values ranged from 28000 to 45000, similar values to these were used for each pose to figure out what was going on in each picture given the area. Given more data and time I think one would find a corollary to each activity and the ratios of their perspective areas, lacking the ratios actually area minimums and maximum had to be used instead.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: