arXiv:1602.01608v2 [cs.RO] 10 Feb 2016 APPEARANCE BASED ROBOT AND HUMAN ACTIVITY RECOGNITION SYSTEM Bappaditya Mandal Email: bmandal@i2r.a-star.edu.sg Institute for Infocomm Research, A*STAR, Singapore ABSTRACT In this work, we present an appearance based human activ- ity recognition system. It uses background modeling to seg- ment the foreground object and extracts useful discrimina- tive features for representing activities performed by humans and robots. Subspace based method like principal component analysis is used to extract low dimensional features from large voluminous activity images. These low dimensional features are then used to classify an activity. An apparatus is designed using a webcam, which watches a robot replicating a human fall under indoor environment. In this apparatus, a robot per- forms various activities (like walking, bending, moving arms) replicating humans, which also includes a sudden fall. Ex- perimental results on robot performing various activities and standard human activity recognition databases show the effi- cacy of our proposed method. Index Terms— Activity recognition system, feature ex- traction, human fall detection, subspace methods. 1. INTRODUCTION Automatic human activity recognition from video is an im- portant problem that plays critical roles in many domains, such as health-care environments, surveillance, athletics and human-computer interactions [1, 2, 3, 4, 5]. Developing al- gorithms to recognize human activities has proven to be an immense challenge since it is a problem that combines the un- certainty associated with computational vision with the added whimsy of human behavior. One of the fundamental chal- lenges of recognizing activities is accounting for the variabil- ity that arises when cameras capture humans performing arbi- trary actions [6, 7]. Popular survey papers for human activity recognition and their challenges can be found in [8, 9, 10]. In this work, an appearance based automatic activity recognition system is presented. This system captures im- ages (videos) and recognizes the activities performed by humans/robots. Presently, this has been tested and evaluated on a pilot robot performing various activities, frontal and side poses in an open environment. Also, our algorithm is tested on publicly available human activity database. Our frame- work has five modules: 1. data acquisition (video) from the camera; 2. normalize the images; 3. extract features using subspace methods; 4. match the features with those of the stored templates in the database 5. output an activity recog- nition ID. We preprocess the images (incoming videos) using the normalization technique described in [11, 12, 13] and then apply the popular statistical pattern recognition method principal component analysis (PCA) [14] to extract useful discriminating features for recognizing various activities. 2. DESCRIPTION OF THE SYSTEM 2.1. Apparatus Design Fig. 1 shows the design of our apparatus. We have tried to replicated a scenario in which a human being (in our case the robot) walks inside the room and performs some daily activ- ities including sudden fall. Presently, our apparatus is made up of plywood and open on one side so that the robot can walk in normally inside the room. The top of the apparatus is open and this leads to uncontrolled lighting environment. The background is indoor environment, where we have some static indoor objects, like sofa, table and lamp. This whole room is being watched by a mounted webcam as shown in Fig. 1. This webcam is connected (wired or wireless) to a PC. It captures the data continuously and transmits to the PC. The captured video is view independent since the robot performs various activities in any arbitrary directions within the view of the camera. Inside the PC, we have our modules running which processes these videos and output the activity state of the robot. In our case, the camera is static and can capture the videos during day and night. Fig. 1. The proposed robot activity recognition apparatus. It has indoor environment, robot walks freely inside this envi- ronment and being watched by a mounted webcam. 2.2. System Overview Fig. 2. Overview of our proposed system, which can detect and recognize four activities: arms moving, bending, falling and walking performed by a robot. Fig. 2 shows the system overview. Our first module is the image data (video) acquisition, which is performed using a simple webcam. The videos are captured and stored as im- age sequences. Currently our webcam captures images at 10 frames/sec with frame size 320×240. The proposed subspace based system framework involves two stages: (i) Training: This is a off-line stage, where some represen- tative samples of the images are captured and used for train- ing. During this stage, the machine learns the projection vec- tors where the training samples have maximum variance. The output of this stage is a set of basis vectors which captures most of the variance energy of the training samples and the projected templates of the training samples (in much reduced dimensions). Both the basis vectors and the templates of var- ious activities are stored in the database. (ii) Recognition: This is a online process, where the sys- tem is tested with unseen images (videos). The new image samples are projected on the projection vectors obtained in (i) and matched against the stored templates in the database. The set of new sample images which matches closest to the stored temples is recognized as that recognition ID. Presently our system has recognition IDs like: Fall Detection, Walk- ing, Bending and Arms Moving. For frontal and side poses estimation similar methodology is followed. As soon as the recognition ID is found, if necessary, the system will generate an alert (SMS/email) to the Care-givers or nearby authorities. Presently, this system is working on a robot performing these activities in any arbitrary directions. We plan to extend it to many more other activities performed by human beings in indoor as well as outdoor environments. 2.3. Foreground Silhouette Map Generation and Data Normalization The captured videos are processed automatically to detect the foreground from the background. We have adopted a method established in our previous work on background modeling and subtraction [15, 16, 17]. The data captured are processed and divided into two types (a) color images of the scene, i.e. robot walking inside the indoor (room) environment as shown in Fig. 3 (left) and (b) binary image silhouettes of the robot as shown in Fig. 3 (middle). Using the silhouettes images, we estimate the centroid of the foreground - robot. a1 and a2 are the two extreme points obtained in the x-axis direction of the foreground object (robot). Similarly, b1 and b2 are the two extreme points in the y-axis direction. Centroid is obtained using the mean of these points. Fig. 3 (middle) shows the es- timation of the points used for calculating centroid. Using this Fig. 3. Left: Original incoming image; Middle: silhouette im- age and centroid is calculated as (a, b) = ((a1 + a2)/2, (b1 + b2)/2); Right: normalized image. (Best viewed in color) centroid point, we crop an image size of 140 × 130 from the original color images. This cropped color image is then nor- malized by (i) converting it into a gray scale image, then (ii) histogram equalization is performed to smooth the distribu- tion of grey values for all the pixels. A sample preprocessed image is shown in Fig. 3 (right), (iii) the image is normalized so that all pixels have mean zero and standard deviation one. 2.4. Activity Training Using Subspace Based Method In this work, an action denotes a short sequence of body con- figurations (arm still, body bending). It is usually, but not exclusively, defined by one or a few body parts [6, 18]. An ac- tivity denotes a sequence of body configurations over a longer span of time. Activities can be assembled from one or more actions and actions can specify details of an activity (e.g. falling with arms raised trying to hold a support). Let the normalized images obtained be of size w-by-h, we can form a training set of column vectors {Xij}, where Xij ∈Rn=wh is called image vector, by lexicographic ordering the pixel el- ements of image j of activity i. Let the training set contain p activities and qi sample images for activity i. The number of total training sample is l = Pp i=1 qi. For activity recognition, each activity is a class with prior probability of ci. The total (mixture) scatter matrix St is defined by St = p X i=1 ci qi qi X j=1 (Xij −X)(Xij −X)T , (1) where X = Pp i=1 ci qi Pqi j=1 Xij. If all classes have equal prior probability, then ci = 1/p. If we regard the elements of the image vector or the class mean vector as features, these preliminary features will be de- correlated by solving the eigenvalue problem [19] Λt = ΦtT StΦt, (2) where Φt = [φt 1, ..., φt n] is the eigenvector matrix of St, and Λt is the diagonal matrix of eigenvalues λt 1, ..., λt n corre- sponding to the eigenvectors. We assume that the eigenvec- tors are sorted according to the eigenvalues in descending order λt 1 ≥, ..., ≥λt n. We perform the dimensionality reduc- tion by selecting the principal projections/directions of the data with larger variances. The first advantage is that we can represent each activity by low dimensional discriminative features [20, 4, 2]. Sec- ondly, the model parameters can be computed directly from the training data, for example, by diagonalizing the sample covariance matrix. So this system does not have any free pa- rameter. This approach is less sensitive to the training data, number of samples per activity and noises present in the data. 2.5. Feature Extraction and Activity Recognition After solving the eigenvalue problem in (2), the dimension- ality reduction is performed here by keeping the eigenvectors with the d largest eigenvalues Φt d = [φt k]d k=1 = [φt 1, ..., φt d], where d is the number of features usually selected by a spe- cific application. A set of projected features in the subspace Y ∈Rd of any image X can be obtained by representing training samples with new feature vectors, Y = Φt d T X. At the recognition stage: Transform each n-D face image vec- tor X into d-D feature vector Y by using the extraction matrix Φt obtained in the training stage. Finally, apply a classifier trained on the gallery set to recognize the probe feature vec- tors. Human/robot (depending on the timer) activities have in- herent varying space-temporal structure. They vary if per- formed by different persons and even the same performer is not ever able to reproduce a movement exactly. So to com- pare two activities of different lengths. we use dynamic time warping (DTW) [21], which performs a time alignment and normalization by computing a temporal transformation allow- ing two activities to be matched. An illustrative example with diagrams is shown in [21]. In all the experiments of this work, a simple first nearest neighborhood classifier (1-NNK) is ap- plied. Euclidean distance measure is used to measure the dis- tance between a probe feature vector and a gallery feature vec- tor. 3. EXPERIMENTAL RESULTS In this work we evaluate our proposed methodology on 3 datasets: (a) robot performing 4 activities, (b) estimation of robot pose for frontal and side views and (c) Weizmann dataset [22, 23] containing several actors performing 10 ac- tions. (a) and (b) datasets are created by us and (c) is a publicly available dataset. We preprocess all the images fol- lowing the normalization procedure described in section 2.3. Each dataset is partitioned into training and testing datasets. There is no overlap in the training and testing datasets. More- over, in Weizmann dataset, there is no overlap in actors performing various activities in training and testing datasets. 3.1. Results on Real Time Robot Activities In our first experiment, we evaluate the proposed approach on a randomly picked up sample of video clip, which is done in- dependently and at different times on pilot robot performing various activities: arms moving, bending, falling and walk- ing. We divided the training and testing sessions and they are performed in the interval of 2 months. For training, a video clip of about 62 seconds (at 10 frames/sec) is captured and manually divided into four activities. Fig. 4 shows the some sample robot images performing two activities: arms moving and falling. Their normalized images are also shown. Using our webcam, independent testing video clip is obtained for 500 seconds. The video clip is captured at 10 frames/second, so 5000 testing images are obtained. Fig. 4. Sample images from our Robot database for two activ- ities (arms moving and falling) and their normalized images. Four sample images per activity. (Best viewed in color) We created the ground truth for the testing sequences based on human knowledge of various activities and mea- sured them against the activities recognized by our proposed system. Probe images captured in real time by our webcam in the interval of 10 frames/sec coming into the system are sam- pled for 1 sec and then our proposed methodology is applied. Table 1 shows confusion matrix using the recognition rate (%) for 5000 testing images (500 seconds of video clip). It is evident from the table that our system recognizes the falling (- an important aspect in heath-care environments) and bending very accurately. However, for arms moving and walking, our system performs poorly. This is probably because our sys- tem, presently, does not employ sophisticated spatiotemporal information. It uses only DTW to represent each class. Nev- ertheless, this system is presently working in real time and is able to detect very accurately the bending pose and falling - a scenario replication commonly encountered in elderly person falling. Moreover, the robot does bending many times while performing arms moving and walking. In this setup, we take only the first nearest neighborhood classifier (1-NNK). We anticipate that 2-NNK (first and second neighborhoods) would improve the results [24]. E.g. Bending −walking or Bedning −arms moving scenarios. Table 1. Confusion matrix of recognition rate (%) using 5000 samples (video clip of 500 seconds) for four activities us- ing 150 features. The rows represent the probe classes and columns represent the gallery classes Activities Arms Bending F alling W alking Moving Arms Moving 8.04 91.07 0 0.89 Bending 0 99.47 0 0.53 F alling 1.96 0 93.14 4.90 W alking 0 86.60 0 13.40 3.2. Results on Real Time Robot Poses In the second experiment, for training, we use 32 images di- vided equally for frontal and side poses. For testing, we use the same test sequence (of 5000 images) as used in the previ- ous section and same experimental setup. However, this time we try to measure the frontal and side poses of the robot while it performs various activities. We perform this experiment so as to evaluate our algorithm for pose estimation. We antic- ipate that this estimation would help us in future algorithm development for real-time systems. Table 2 shows confusion matrix using the recognition rate (%) for 5000 testing images (500 seconds of video clip). From Table 2 it is evident that Table 2. Confusion matrix of recognition rate (%) using 5000 samples (video clip of 500 seconds) for frontal and side poses using 30 features. The rows represent the probe classes and columns represent the gallery classes Poses Frontal Side Frontal 57.93 42.07 side 9.17 90.83 our proposed algorithm estimated the side pose better than the frontal pose. More classes, like half left profile, half right profile would further help in getting more accurate pose esti- mations. We intend to develop these in our future work. 3.3. Results on Weizmann dataset We have also evaluated the proposed approach on publicly available Weizmann dataset [22, 23]. It contains 10 actions: bend (bend), jumping-jack (jack), jump-in-place (pjump), jump-forward (jump), run (run), gallop-sideways (side), jump-forward-one-leg (skip), walk (walk), wave one hand (wave1), wave two hands (wave2), performed by 9 actors. Each of the video sequences ranges from 30 to 120 video frames. Silhouettes extracted from backgrounds and original image sequences are provided. We divide this database into training and testing sets. 4 actors performing 10 activities are used in the training while remaining 5 actors performing 10 activities are used in the testing. We run the experiments starting from 20 features to 200 features in the interval of 20 features. The results seem to be stable with 80 features, after which no improvement is obtained in recognition perfor- mance. Table 3 shows the confusion matrix using the activity Table 3. Confusion matrix of the recognition rates (%) using our approach for 10 activities: (1) bend, (2) jack, (3) jump, (4) pjump, (5) run, (6) side, (7) skip, (8) walk, (9) wave1 and (10) wave2, averaged over testing samples of 5 performers. Activities 1 2 3 4 5 6 7 8 9 10 1 96.67 2.00 1.33 0 0 0 0 0 0 0 2 2.33 97.67 0 0 0 0 0 0 0 0 3 0 0 96.67 3.33 0 0 0 0 0 0 4 0 0 2.67 92.33 0 0 3.33 1.67 0 0 5 0 0 0 0 95.33 0 0 4.67 0 0 6 0 0 0 2.00 0 92.33 5.67 0 0 0 7 0 1.67 0 0 0 3.00 95.33 0 0 0 8 0 0 0 0 3.67 0 0 96.33 0 0 9 0 0 0 0 0 1.00 0 0 95.33 3.67 10 0 0 0 2.00 0 0 0 0 1.33 96.67 recognition rate between the gallery and testing samples av- eraged over 5 performers for 10 activities. It is evident from the table that our system works well for most of the activities except pjump and side, where they are confused with skip. This is probably because they require more spatiotemporal information for better discrimination. We plan to incorporate this in our future algorithm developments. 3.4. Summary and Conclusions In this work, we have developed a subspace based activity recognition system. Subspace or appearance based method like PCA is very much stable in noisy environment and not so sensitive to the training samples. Moreover, it handles the out-liners very efficiently. Sample covariance matrix is com- puted from the training data and after eigen decomposition, eigenvectors corresponding to the largest eigenvalues are se- lected. All gallery and test samples are projected onto this re- duced subspace and low dimensional discriminative features are extracted. Experimental results on three databases show promising results. Using this methodology, presently, a re- altime system is working, which can recognize a robot per- forming four activities and estimates frontal and side poses. Moreover, this system is also working on a publicly available database, where it can recognize many activities performed by humans. 4. REFERENCES [1] D. Snchez, M. Tentori, and J. Favela, “Activity recog- nition for the smart hospital,” IEEE Intelligent Systems, vol. 23, no. 2, pp. 50–57, March 2008. [2] B. Mandal, X. Jiang, H-L. Eng, and A. Kot, “Predic- tion of eigenvalues and regularization of eigenfeatures for human face verification,” Pattern Recognition Let- ters, vol. 31, no. 8, pp. 717–724, 2010. [3] M. Leo, T. D’Orazio, and P. Spagnolo, “Human activ- ity recognition for automatic visual surveillance of wide areas,” in ACM 2nd international workshop on Video surveillance and sensor networks, New York, USA, 2004, pp. 124–130. [4] B. Mandal, X. D. Jiang, and A. Kot, “Verification of human faces using predicted eigenvalues,” in 19th In- ternational Conference on Pattern Recognition (ICPR), Tempa, Florida, USA, Dec 2008, pp. 1–4. [5] B. Mandal, S. Ching, L. Li, V. Chandrasekha, C. Tan, and J-H. Lim, “A wearable face recognition system on google glass for assisting social interactions,” in 3rd International Workshop on Intelligent Mobile and Ego- centric Vision, ACCV, Nov 2014, pp. 419–433. [6] B. Mandal and H-L. Eng, “3-parameter based eigen- feature regularization for human activity recognition,” in 35th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Dallas, Texas, USA, 2010, pp. 954–957. [7] B. Mandal and H-L. Eng, “Regularized discriminant analysis for holistic human activity recognition,” IEEE Intelligent Systems, vol. 27, no. 1, pp. 21–31, 2012. [8] T. Moeslund, A. Hilton, and V. Kruger, “A survey of ad- vances in vision-based human motion capture and anal- ysis,” CVIU, vol. 104, no. 2, pp. 90–126, 2006. [9] T. Moeslund and E. Granum, “A survey of computer vision-based human motion capture,” CVIU, vol. 81, no. 3, pp. 231–268, 2001. [10] Ronald Poppe, “A survey on vision-based human action recognition,” Image and Vision Computing, vol. 28, no. 6, pp. 976–990, Jun 2010. [11] R. Beveridge, D. Bolme, M. Teixeira, and B. Draper, “The csu face identification evaluation system users guide: Version 5.0,” Technical Report: http://www.cs.colostate.edu/evalfacerec/data/normalization.html, 2003. [12] B. Mandal, X. D. Jiang, and A. Kot, “Multi-scale feature extraction for face recognition,” in IEEE International Conference on Industrial Electronics and Applications (ICIEA), May 2006, pp. 1–6. [13] X. D. Jiang, B. Mandal, and A. Kot, “Face recognition based on discriminant evaluation in the whole space,” in IEEE 32nd International Conference on Acoustics, Speech and Signal Processing, Honolulu, Hawaii, USA, Apr 2007, pp. 245–248. [14] K. Fukunaga, Introduction to Statistical Pattern Recog- nition, Academic Press, INC, San Diego, CA, USA, 1990. [15] H-L. Eng, J. Wang, A. Kam, and W. Yau, “Robust hu- man detection within a highly dynamic aquatic environ- ment in real time,” IEEE Trans. Image Processing, vol. 15, no. 6, pp. 1583–1600, 2006. [16] B. Mandal, H-L Eng, H. Lu, D. W-S. Chan, and Y-L Ng, “Non-intrusive head movement analysis of video- taped seizures of epileptic origin,” in IEEE International Conference on Engineering in Medicine and Biology So- ciety, 2012, pp. 6060–6063. [17] H. Lu, Y. Pan, B. Mandal, H-L. Eng, C. Guan, and D. W-S. Chan, “Quantifying limb movements in epilep- tic seizures through color-based video analysis,” IEEE Transaction Biomedical Engineering, vol. 60, no. 2, pp. 461–469, 2013. [18] H. Lu, H-L. Eng, B. Mandal, D. W-S. Chan, and Y-L Ng, “Markerless video analysis for movement quantifi- cation in pediatric epilepsy monitoring,” in IEEE In- ternational Conference on Engineering in Medicine and Biology Society, 2011, pp. 8275–8278. [19] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classi- fication, John Wiley & Sons, New York, 2001. [20] B. Mandal, X. D. Jiang, and A. Kot, “Dimensionality reduction in subspace face recognition,” in 6th IEEE International Conference on Information, Communica- tions & Signal Processing (ICICS), Dec 2007, pp. 1–5. [21] N. V. Boulgouris, K. Plataniotis, and D. Hatzinakos, “Gait recognition using dynamic time warping,” in IEEE Workshop on Multimedia Signal Processing, Sept 2004, pp. 263–266. [22] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in The Tenth IEEE International Conference on Computer Vi- sion (ICCV’05), 2005, pp. 1395–1402. [23] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247–2253, December 2007. [24] B. Mandal, W. Zhikai, L. Li, and A. Kassim, “Eval- uation of descriptors and distance measures on bench- marks and first-person-view videos for face identifica- tion,” in International Workshop on Robust Local De- scriptors for Computer Vision, ACCV, Nov 2014, pp. 585–599. arXiv:1602.01608v2 [cs.RO] 10 Feb 2016 APPEARANCE BASED ROBOT AND HUMAN ACTIVITY RECOGNITION SYSTEM Bappaditya Mandal Email: bmandal@i2r.a-star.edu.sg Institute for Infocomm Research, A*STAR, Singapore ABSTRACT In this work, we present an appearance based human activ- ity recognition system. It uses background modeling to seg- ment the foreground object and extracts useful discrimina- tive features for representing activities performed by humans and robots. Subspace based method like principal component analysis is used to extract low dimensional features from large voluminous activity images. These low dimensional features are then used to classify an activity. An apparatus is designed using a webcam, which watches a robot replicating a human fall under indoor environment. In this apparatus, a robot per- forms various activities (like walking, bending, moving arms) replicating humans, which also includes a sudden fall. Ex- perimental results on robot performing various activities and standard human activity recognition databases show the effi- cacy of our proposed method. Index Terms— Activity recognition system, feature ex- traction, human fall detection, subspace methods. 1. INTRODUCTION Automatic human activity recognition from video is an im- portant problem that plays critical roles in many domains, such as health-care environments, surveillance, athletics and human-computer interactions [1, 2, 3, 4, 5]. Developing al- gorithms to recognize human activities has proven to be an immense challenge since it is a problem that combines the un- certainty associated with computational vision with the added whimsy of human behavior. One of the fundamental chal- lenges of recognizing activities is accounting for the variabil- ity that arises when cameras capture humans performing arbi- trary actions [6, 7]. Popular survey papers for human activity recognition and their challenges can be found in [8, 9, 10]. In this work, an appearance based automatic activity recognition system is presented. This system captures im- ages (videos) and recognizes the activities performed by humans/robots. Presently, this has been tested and evaluated on a pilot robot performing various activities, frontal and side poses in an open environment. Also, our algorithm is tested on publicly available human activity database. Our frame- work has five modules: 1. data acquisition (video) from the camera; 2. normalize the images; 3. extract features using subspace methods; 4. match the features with those of the stored templates in the database 5. output an activity recog- nition ID. We preprocess the images (incoming videos) using the normalization technique described in [11, 12, 13] and then apply the popular statistical pattern recognition method principal component analysis (PCA) [14] to extract useful discriminating features for recognizing various activities. 2. DESCRIPTION OF THE SYSTEM 2.1. Apparatus Design Fig. 1 shows the design of our apparatus. We have tried to replicated a scenario in which a human being (in our case the robot) walks inside the room and performs some daily activ- ities including sudden fall. Presently, our apparatus is made up of plywood and open on one side so that the robot can walk in normally inside the room. The top of the apparatus is open and this leads to uncontrolled lighting environment. The background is indoor environment, where we have some static indoor objects, like sofa, table and lamp. This whole room is being watched by a mounted webcam as shown in Fig. 1. This webcam is connected (wired or wireless) to a PC. It captures the data continuously and transmits to the PC. The captured video is view independent since the robot performs various activities in any arbitrary directions within the view of the camera. Inside the PC, we have our modules running which processes these videos and output the activity state of the robot. In our case, the camera is static and can capture the videos during day and night. Fig. 1. The proposed robot activity recognition apparatus. It has indoor environment, robot walks freely inside this envi- ronment and being watched by a mounted webcam. 2.2. System Overview Fig. 2. Overview of our proposed system, which can detect and recognize four activities: arms moving, bending, falling and walking performed by a robot. Fig. 2 shows the system overview. Our first module is the image data (video) acquisition, which is performed using a simple webcam. The videos are captured and stored as im- age sequences. Currently our webcam captures images at 10 frames/sec with frame size 320×240. The proposed subspace based system framework involves two stages: (i) Training: This is a off-line stage, where some represen- tative samples of the images are captured and used for train- ing. During this stage, the machine learns the projection vec- tors where the training samples have maximum variance. The output of this stage is a set of basis vectors which captures most of the variance energy of the training samples and the projected templates of the training samples (in much reduced dimensions). Both the basis vectors and the templates of var- ious activities are stored in the database. (ii) Recognition: This is a online process, where the sys- tem is tested with unseen images (videos). The new image samples are projected on the projection vectors obtained in (i) and matched against the stored templates in the database. The set of new sample images which matches closest to the stored temples is recognized as that recognition ID. Presently our system has recognition IDs like: Fall Detection, Walk- ing, Bending and Arms Moving. For frontal and side poses estimation similar methodology is followed. As soon as the recognition ID is found, if necessary, the system will generate an alert (SMS/email) to the Care-givers or nearby authorities. Presently, this system is working on a robot performing these activities in any arbitrary directions. We plan to extend it to many more other activities performed by human beings in indoor as well as outdoor environments. 2.3. Foreground Silhouette Map Generation and Data Normalization The captured videos are processed automatically to detect the foreground from the background. We have adopted a method established in our previous work on background modeling and subtraction [15, 16, 17]. The data captured are processed and divided into two types (a) color images of the scene, i.e. robot walking inside the indoor (room) environment as shown in Fig. 3 (left) and (b) binary image silhouettes of the robot as shown in Fig. 3 (middle). Using the silhouettes images, we estimate the centroid of the foreground - robot. a1 and a2 are the two extreme points obtained in the x-axis direction of the foreground object (robot). Similarly, b1 and b2 are the two extreme points in the y-axis direction. Centroid is obtained using the mean of these points. Fig. 3 (middle) shows the es- timation of the points used for calculating centroid. Using this Fig. 3. Left: Original incoming image; Middle: silhouette im- age and centroid is calculated as (a, b) = ((a1 + a2)/2, (b1 + b2)/2); Right: normalized image. (Best viewed in color) centroid point, we crop an image size of 140 × 130 from the original color images. This cropped color image is then nor- malized by (i) converting it into a gray scale image, then (ii) histogram equalization is performed to smooth the distribu- tion of grey values for all the pixels. A sample preprocessed image is shown in Fig. 3 (right), (iii) the image is normalized so that all pixels have mean zero and standard deviation one. 2.4. Activity Training Using Subspace Based Method In this work, an action denotes a short sequence of body con- figurations (arm still, body bending). It is usually, but not exclusively, defined by one or a few body parts [6, 18]. An ac- tivity denotes a sequence of body configurations over a longer span of time. Activities can be assembled from one or more actions and actions can specify details of an activity (e.g. falling with arms raised trying to hold a support). Let the normalized images obtained be of size w-by-h, we can form a training set of column vectors {Xij}, where Xij ∈Rn=wh is called image vector, by lexicographic ordering the pixel el- ements of image j of activity i. Let the training set contain p activities and qi sample images for activity i. The number of total training sample is l = Pp i=1 qi. For activity recognition, each activity is a class with prior probability of ci. The total (mixture) scatter matrix St is defined by St = p X i=1 ci qi qi X j=1 (Xij −X)(Xij −X)T , (1) where X = Pp i=1 ci qi Pqi j=1 Xij. If all classes have equal prior probability, then ci = 1/p. If we regard the elements of the image vector or the class mean vector as features, these preliminary features will be de- correlated by solving the eigenvalue problem [19] Λt = ΦtT StΦt, (2) where Φt = [φt 1, ..., φt n] is the eigenvector matrix of St, and Λt is the diagonal matrix of eigenvalues λt 1, ..., λt n corre- sponding to the eigenvectors. We assume that the eigenvec- tors are sorted according to the eigenvalues in descending order λt 1 ≥, ..., ≥λt n. We perform the dimensionality reduc- tion by selecting the principal projections/directions of the data with larger variances. The first advantage is that we can represent each activity by low dimensional discriminative features [20, 4, 2]. Sec- ondly, the model parameters can be computed directly from the training data, for example, by diagonalizing the sample covariance matrix. So this system does not have any free pa- rameter. This approach is less sensitive to the training data, number of samples per activity and noises present in the data. 2.5. Feature Extraction and Activity Recognition After solving the eigenvalue problem in (2), the dimension- ality reduction is performed here by keeping the eigenvectors with the d largest eigenvalues Φt d = [φt k]d k=1 = [φt 1, ..., φt d], where d is the number of features usually selected by a spe- cific application. A set of projected features in the subspace Y ∈Rd of any image X can be obtained by representing training samples with new feature vectors, Y = Φt d T X. At the recognition stage: Transform each n-D face image vec- tor X into d-D feature vector Y by using the extraction matrix Φt obtained in the training stage. Finally, apply a classifier trained on the gallery set to recognize the probe feature vec- tors. Human/robot (depending on the timer) activities have in- herent varying space-temporal structure. They vary if per- formed by different persons and even the same performer is not ever able to reproduce a movement exactly. So to com- pare two activities of different lengths. we use dynamic time warping (DTW) [21], which performs a time alignment and normalization by computing a temporal transformation allow- ing two activities to be matched. An illustrative example with diagrams is shown in [21]. In all the experiments of this work, a simple first nearest neighborhood classifier (1-NNK) is ap- plied. Euclidean distance measure is used to measure the dis- tance between a probe feature vector and a gallery feature vec- tor. 3. EXPERIMENTAL RESULTS In this work we evaluate our proposed methodology on 3 datasets: (a) robot performing 4 activities, (b) estimation of robot pose for frontal and side views and (c) Weizmann dataset [22, 23] containing several actors performing 10 ac- tions. (a) and (b) datasets are created by us and (c) is a publicly available dataset. We preprocess all the images fol- lowing the normalization procedure described in section 2.3. Each dataset is partitioned into training and testing datasets. There is no overlap in the training and testing datasets. More- over, in Weizmann dataset, there is no overlap in actors performing various activities in training and testing datasets. 3.1. Results on Real Time Robot Activities In our first experiment, we evaluate the proposed approach on a randomly picked up sample of video clip, which is done in- dependently and at different times on pilot robot performing various activities: arms moving, bending, falling and walk- ing. We divided the training and testing sessions and they are performed in the interval of 2 months. For training, a video clip of about 62 seconds (at 10 frames/sec) is captured and manually divided into four activities. Fig. 4 shows the some sample robot images performing two activities: arms moving and falling. Their normalized images are also shown. Using our webcam, independent testing video clip is obtained for 500 seconds. The video clip is captured at 10 frames/second, so 5000 testing images are obtained. Fig. 4. Sample images from our Robot database for two activ- ities (arms moving and falling) and their normalized images. Four sample images per activity. (Best viewed in color) We created the ground truth for the testing sequences based on human knowledge of various activities and mea- sured them against the activities recognized by our proposed system. Probe images captured in real time by our webcam in the interval of 10 frames/sec coming into the system are sam- pled for 1 sec and then our proposed methodology is applied. Table 1 shows confusion matrix using the recognition rate (%) for 5000 testing images (500 seconds of video clip). It is evident from the table that our system recognizes the falling (- an important aspect in heath-care environments) and bending very accurately. However, for arms moving and walking, our system performs poorly. This is probably because our sys- tem, presently, does not employ sophisticated spatiotemporal information. It uses only DTW to represent each class. Nev- ertheless, this system is presently working in real time and is able to detect very accurately the bending pose and falling - a scenario replication commonly encountered in elderly person falling. Moreover, the robot does bending many times while performing arms moving and walking. In this setup, we take only the first nearest neighborhood classifier (1-NNK). We anticipate that 2-NNK (first and second neighborhoods) would improve the results [24]. E.g. Bending −walking or Bedning −arms moving scenarios. Table 1. Confusion matrix of recognition rate (%) using 5000 samples (video clip of 500 seconds) for four activities us- ing 150 features. The rows represent the probe classes and columns represent the gallery classes Activities Arms Bending F alling W alking Moving Arms Moving 8.04 91.07 0 0.89 Bending 0 99.47 0 0.53 F alling 1.96 0 93.14 4.90 W alking 0 86.60 0 13.40 3.2. Results on Real Time Robot Poses In the second experiment, for training, we use 32 images di- vided equally for frontal and side poses. For testing, we use the same test sequence (of 5000 images) as used in the previ- ous section and same experimental setup. However, this time we try to measure the frontal and side poses of the robot while it performs various activities. We perform this experiment so as to evaluate our algorithm for pose estimation. We antic- ipate that this estimation would help us in future algorithm development for real-time systems. Table 2 shows confusion matrix using the recognition rate (%) for 5000 testing images (500 seconds of video clip). From Table 2 it is evident that Table 2. Confusion matrix of recognition rate (%) using 5000 samples (video clip of 500 seconds) for frontal and side poses using 30 features. The rows represent the probe classes and columns represent the gallery classes Poses Frontal Side Frontal 57.93 42.07 side 9.17 90.83 our proposed algorithm estimated the side pose better than the frontal pose. More classes, like half left profile, half right profile would further help in getting more accurate pose esti- mations. We intend to develop these in our future work. 3.3. Results on Weizmann dataset We have also evaluated the proposed approach on publicly available Weizmann dataset [22, 23]. It contains 10 actions: bend (bend), jumping-jack (jack), jump-in-place (pjump), jump-forward (jump), run (run), gallop-sideways (side), jump-forward-one-leg (skip), walk (walk), wave one hand (wave1), wave two hands (wave2), performed by 9 actors. Each of the video sequences ranges from 30 to 120 video frames. Silhouettes extracted from backgrounds and original image sequences are provided. We divide this database into training and testing sets. 4 actors performing 10 activities are used in the training while remaining 5 actors performing 10 activities are used in the testing. We run the experiments starting from 20 features to 200 features in the interval of 20 features. The results seem to be stable with 80 features, after which no improvement is obtained in recognition perfor- mance. Table 3 shows the confusion matrix using the activity Table 3. Confusion matrix of the recognition rates (%) using our approach for 10 activities: (1) bend, (2) jack, (3) jump, (4) pjump, (5) run, (6) side, (7) skip, (8) walk, (9) wave1 and (10) wave2, averaged over testing samples of 5 performers. Activities 1 2 3 4 5 6 7 8 9 10 1 96.67 2.00 1.33 0 0 0 0 0 0 0 2 2.33 97.67 0 0 0 0 0 0 0 0 3 0 0 96.67 3.33 0 0 0 0 0 0 4 0 0 2.67 92.33 0 0 3.33 1.67 0 0 5 0 0 0 0 95.33 0 0 4.67 0 0 6 0 0 0 2.00 0 92.33 5.67 0 0 0 7 0 1.67 0 0 0 3.00 95.33 0 0 0 8 0 0 0 0 3.67 0 0 96.33 0 0 9 0 0 0 0 0 1.00 0 0 95.33 3.67 10 0 0 0 2.00 0 0 0 0 1.33 96.67 recognition rate between the gallery and testing samples av- eraged over 5 performers for 10 activities. It is evident from the table that our system works well for most of the activities except pjump and side, where they are confused with skip. This is probably because they require more spatiotemporal information for better discrimination. We plan to incorporate this in our future algorithm developments. 3.4. Summary and Conclusions In this work, we have developed a subspace based activity recognition system. Subspace or appearance based method like PCA is very much stable in noisy environment and not so sensitive to the training samples. Moreover, it handles the out-liners very efficiently. Sample covariance matrix is com- puted from the training data and after eigen decomposition, eigenvectors corresponding to the largest eigenvalues are se- lected. All gallery and test samples are projected onto this re- duced subspace and low dimensional discriminative features are extracted. Experimental results on three databases show promising results. Using this methodology, presently, a re- altime system is working, which can recognize a robot per- forming four activities and estimates frontal and side poses. Moreover, this system is also working on a publicly available database, where it can recognize many activities performed by humans. 4. REFERENCES [1] D. Snchez, M. Tentori, and J. Favela, “Activity recog- nition for the smart hospital,” IEEE Intelligent Systems, vol. 23, no. 2, pp. 50–57, March 2008. [2] B. Mandal, X. Jiang, H-L. Eng, and A. Kot, “Predic- tion of eigenvalues and regularization of eigenfeatures for human face verification,” Pattern Recognition Let- ters, vol. 31, no. 8, pp. 717–724, 2010. [3] M. Leo, T. D’Orazio, and P. Spagnolo, “Human activ- ity recognition for automatic visual surveillance of wide areas,” in ACM 2nd international workshop on Video surveillance and sensor networks, New York, USA, 2004, pp. 124–130. [4] B. Mandal, X. D. Jiang, and A. Kot, “Verification of human faces using predicted eigenvalues,” in 19th In- ternational Conference on Pattern Recognition (ICPR), Tempa, Florida, USA, Dec 2008, pp. 1–4. [5] B. Mandal, S. Ching, L. Li, V. Chandrasekha, C. Tan, and J-H. Lim, “A wearable face recognition system on google glass for assisting social interactions,” in 3rd International Workshop on Intelligent Mobile and Ego- centric Vision, ACCV, Nov 2014, pp. 419–433. [6] B. Mandal and H-L. Eng, “3-parameter based eigen- feature regularization for human activity recognition,” in 35th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Dallas, Texas, USA, 2010, pp. 954–957. [7] B. Mandal and H-L. Eng, “Regularized discriminant analysis for holistic human activity recognition,” IEEE Intelligent Systems, vol. 27, no. 1, pp. 21–31, 2012. [8] T. Moeslund, A. Hilton, and V. Kruger, “A survey of ad- vances in vision-based human motion capture and anal- ysis,” CVIU, vol. 104, no. 2, pp. 90–126, 2006. [9] T. Moeslund and E. Granum, “A survey of computer vision-based human motion capture,” CVIU, vol. 81, no. 3, pp. 231–268, 2001. [10] Ronald Poppe, “A survey on vision-based human action recognition,” Image and Vision Computing, vol. 28, no. 6, pp. 976–990, Jun 2010. [11] R. Beveridge, D. Bolme, M. Teixeira, and B. Draper, “The csu face identification evaluation system users guide: Version 5.0,” Technical Report: http://www.cs.colostate.edu/evalfacerec/data/normalization.html, 2003. [12] B. Mandal, X. D. Jiang, and A. Kot, “Multi-scale feature extraction for face recognition,” in IEEE International Conference on Industrial Electronics and Applications (ICIEA), May 2006, pp. 1–6. [13] X. D. Jiang, B. Mandal, and A. Kot, “Face recognition based on discriminant evaluation in the whole space,” in IEEE 32nd International Conference on Acoustics, Speech and Signal Processing, Honolulu, Hawaii, USA, Apr 2007, pp. 245–248. [14] K. Fukunaga, Introduction to Statistical Pattern Recog- nition, Academic Press, INC, San Diego, CA, USA, 1990. [15] H-L. Eng, J. Wang, A. Kam, and W. Yau, “Robust hu- man detection within a highly dynamic aquatic environ- ment in real time,” IEEE Trans. Image Processing, vol. 15, no. 6, pp. 1583–1600, 2006. [16] B. Mandal, H-L Eng, H. Lu, D. W-S. Chan, and Y-L Ng, “Non-intrusive head movement analysis of video- taped seizures of epileptic origin,” in IEEE International Conference on Engineering in Medicine and Biology So- ciety, 2012, pp. 6060–6063. [17] H. Lu, Y. Pan, B. Mandal, H-L. Eng, C. Guan, and D. W-S. Chan, “Quantifying limb movements in epilep- tic seizures through color-based video analysis,” IEEE Transaction Biomedical Engineering, vol. 60, no. 2, pp. 461–469, 2013. [18] H. Lu, H-L. Eng, B. Mandal, D. W-S. Chan, and Y-L Ng, “Markerless video analysis for movement quantifi- cation in pediatric epilepsy monitoring,” in IEEE In- ternational Conference on Engineering in Medicine and Biology Society, 2011, pp. 8275–8278. [19] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classi- fication, John Wiley & Sons, New York, 2001. [20] B. Mandal, X. D. Jiang, and A. Kot, “Dimensionality reduction in subspace face recognition,” in 6th IEEE International Conference on Information, Communica- tions & Signal Processing (ICICS), Dec 2007, pp. 1–5. [21] N. V. Boulgouris, K. Plataniotis, and D. Hatzinakos, “Gait recognition using dynamic time warping,” in IEEE Workshop on Multimedia Signal Processing, Sept 2004, pp. 263–266. [22] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in The Tenth IEEE International Conference on Computer Vi- sion (ICCV’05), 2005, pp. 1395–1402. [23] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247–2253, December 2007. [24] B. Mandal, W. Zhikai, L. Li, and A. Kassim, “Eval- uation of descriptors and distance measures on bench- marks and first-person-view videos for face identifica- tion,” in International Workshop on Robust Local De- scriptors for Computer Vision, ACCV, Nov 2014, pp. 585–599.