arXiv:1608.08395v1 [cs.CV] 30 Aug 2016 Motion Representation with Acceleration Images Hirokatsu Kataoka, Yun He, Soma Shirakabe, Yutaka Satoh National Institute of Advanced Industrial Science and Technology (AIST) Tsukuba, Ibaraki, Japan {hirokatsu.kataoka, yun.he, shirakabe-s, yu.satou}@aist.go.jp Abstract Information of time differentiation is extremely important cue for a motion representation. We have applied first-order differential velocity from a positional information, more- over we believe that second-order differential acceleration is also a significant feature in a motion representation. How- ever, an acceleration image based on a typical optical flow includes motion noises. We have not employed the acceler- ation image because the noises are too strong to catch an ef- fective motion feature in an image sequence. On one hand, the recent convolutional neural networks (CNN) are robust against input noises. In this paper, we employ acceleration-stream in addition to the spatial- and temporal-stream based on the two-stream CNN. We clearly show the effectiveness of adding the ac- celeration stream to the two-stream CNN. 1 Introduction Highly discriminative motion representation is needed in the fields of action recognition, event recognition, and video understanding. Space-time interest points (STIP) that cap- ture temporal keypoints are a giant step toward solving vi- sual motion representation. An improvement over STIP is the so-called dense trajectories (DT) proposed by Wang et al. [20]. The simple purpose of DT is to have denser sam- pling and more various descriptors than STIP. In 2013, DT was improved by three techniques, namely, camera mo- tion estimation with SURF, Fisher vector representation, and detection-based noise canceling [21]. The powerful framework of DT or improved DT (DT/IDT) has been cited in numerous papers as of 2016. However, the success of convolutional neural networks (CNN) cannot be ignored in image-based recognition. We project motion informa- tion into images in order to implement the CNN archi- tecture for motion representation. The two-stream CNN is a noteworthy algorithm to capture the temporal fea- tures in an image sequence [18]. The integration of spa- tial and temporal streams allows us to effectively enhance motion representation. We obtain significant knowledge about the spatial information, which helps the temporal fea- ture. The strongest approach introduced is the crosspoint of the IDT and the two-stream CNN. Trajectory-pooled deep-convolutional descriptors (TDD) [22] have achieved the highest performance in several benchmarks, such as UCF101 [19] (91.5%) and HMDB51 (65.9%) [11]. A more recent performance was demonstrated in the Activ- ityNet challenge in conjunction with CVPR2016. At this performance, the TDD-based approach surprisingly accom- plished a 93.2% mAP (94.2% on UCF101 and 69.4% on HMDB51). However, the current approaches heavily rely on the two- stream architecture. To improve motion-based features, we must employ the acceleration stream for richer image rep- resentation. In physics, acceleration is the change rate of speed with respect to time. Here, acceleration images are able to extract a precise feature from an image sequence. In this paper, we propose the simple technique of us- ing “acceleration images” to represent a change of a flow image. The acceleration images must be significant be- cause the representation is different from position (RGB) and speed (flow) images. We apply two-stream CNN [18] as the baseline; then, we employ an acceleration stream, in addition to the spatial and the temporal streams. The ac- celeration images are generated by differential calculations from a sequence of flow images. Although the sparse rep- resentation tends to be noisy data (see Figure 1), automatic feature learning with CNN can significantly pick up a nec- essary feature in the acceleration images. We carry out ex- periments on traffic data in the NTSEL dataset [7]. 2 Related work Space-time interest points (STIP) have been a primary focus in action recognition [13]. In STIP, time t space is added to the x, y spatial domain. Improvements of STIP have been reported in several papers, such as [14], [15], [3]. However, the significant approach is arguably the dense trajectories approach (DT) [20]. The DT is describes the trajectories that track densely sampled feature points. Descriptors are applied to the densely captured trajectories by histograms of oriented gradients (HOG) [1], histograms of optical flow (HOF) [14], and motion boundary histograms (MBH) [2]. 1 Figure 1: Image representation of RGB (I), flow (I ′), and acceleration (I ′′). Dense sampling approaches for activity recognition were also proposed in [6, 8, 21] after the introduction of the first DT. These studies incremented DT, for example, by elim- inating extra flow [6] and integrating a higher-order de- scriptor into the conventional features for fine-grained ac- tion recognition [8]. Additionally, Wang et al. proposed an IDT [21] by executing camera motion estimation, cancel- ing detection-based noise, and adding a Fisher vector [16]. More recent work has reported state-of-the-art performance achieved with the concatenation of CNN features and IDT in the THUMOS Challenge [4, 5, 25]. Jain et al. em- ployed a per-frame CNN feature from layers 6, 7, and 8 with AlexNet [10]. Zhu et al. [25] extended both the representations with multi-scale temporal sampling in the IDT [12] and video representations in the CNN feature [24]. The combination of IDT and CNN synergistically improves recognition performance. Recently, CNN features with temporal representations have been proposed [18, 17, 22]. Ryoo et al. clearly bested IDT+CNN with their pooled time series (PoT) that continuously accumulates frame differences between two frames [17]. The feature is simple but effective for grasping continuous action sequences. The feature type that should be implemented, however, is one that improves the repre- sentation so that it adequately fits the transitional action recognition. It is difficult to achieve short-term predic- tion by using the PoT, because it describes features from a whole image sequence. Kataoka proposed a subtle motion descriptor (SMD) to represent sensitive motion in spatio- temporal human actions [9]. The SMD enhances the zero- around temporal pooled feature. Two-stream CNN is a well-organized algorithm that captures the temporal feature in an image sequence [18]. The integration of the spatial and the temporal streams allows us to effectively enhance the motion representation. We can obtain significant knowl- edge about how spatial information helps the temporal fea- ture. Moreover, the strongest approach introduced is at the crosspoint of the IDT and two-stream CNN. TDDs have achieved the highest performance in several benchmarks, such as UCF101 (91.5%) and HMDB51 (65.9%) [22]. 3 Acceleration images into two- stream CNN Acceleration images. The placement of acceleration im- ages is shown in Figure 1. The acceleration image I ′′ is a second-order differential from a position image I that is RGB input. The acceleration image I ′′ is shown below: I ′′ x = I′(i + 1, j) −I′(i, j) (1) I ′′ y = I′(i, j + 1) −I′(i, j) (2) where i and j are elements of x and y. I′ indicates a flow image that is calculated with optical flow displacement (d) [18]: I ′ x = dx(u, v) (3) I ′ y = dy(u, v) (4) where (u, v) is an arbitrary point. The accel- eration and flow images are stacked 10 frames in a row as (I ′′ x1, I ′′ y1, I ′′ x2, I ′′ y2, ..., I ′′ x10, I ′′ y10) and (I ′ x1, I ′ y1, I ′ x2, I ′ y2, ..., I ′ x10, I ′ y10) [18]. We implement VGGNet, which is supported by Limin Wang [23]. We integrate the acceleration stream on the two- stream CNN, in addition to the spatial and temporal streams as follows: f = fspa + αftem + βfacc (5) where f indicates the softmax function, and spa, tem, and acc correspond to the spatial, temporal, and acceleration streams, respectively. α (= 2.0) and β (= 2.0) are weighted parameters. 2 Figure 2: NTSEL dataset. CNN training. The learning procedure of the spatial and the temporal streams is based on [23]. We employ a tempo- ral net as a pre-trained model, because the 20-channel input and image values are very similar. The initial learning rate is set as 0.001 and updating is x0.1 for every 10,000 iter- ations. The learning of the acceleration stream terminates at 50,000 iterations. We assign a high dropout ratio in all fully connected (fc) layers. We set 0.9 (first fc layer) and 0.9 (second fc layer) for the acceleration stream. 4 Experiment NTSEL dataset (NTSEL) [7] (Figure 2). The dataset contains near-miss events captured by a vehicle. We fo- cused on a pedestrian’s gradual changes walking straight, turning, which is a fine-grained activity on real roads. The four activities are walking, turning, crossing, and riding a bicycle. The dataset has 100 videos of pedestrian actions. Each of the four actions has 25 videos: 15 videos for train- ing and the other 10 videos for testing. A difficulty of the dataset is to divide walking activities (e.g., walking, turn- ing, crossing) with similar appearances from the image se- quence. Primitive motion understanding is beneficial to the dataset. Results. The results are shown in Table 1. The perfor- mance rate is based on per-video calculations. The video recognition system outputs an action label for each video. Approach % on NTSEL Spatial stream 87.5 Temporal stream 77.5 Acceleration stream 82.5 Two streams (S+T) [23] 87.5 Three stream (S+T+A; ours) 90.0 Table 1: Performance rate of three-stream architecture (spatial + temporal + acceleration; S+T+A) and other ap- proaches on the NTSEL dataset. Our proposed algorithm that adds the acceleration stream significantly outperforms the two-stream CNN with an in- crease of 2.5% on the NTSEL dataset. The correct recog- nitions of the spatial, temporal, and acceleration streams are 87.5%, 77.5%, and 82.5%, respectively. Surprisingly, the acceleration stream performs better than the temporal stream. The acceleration stream effectively recognizes the movement of acceleration in the traffic data. We confirmed that the motion feature of acceleration in an image sequence improves video recognition. Although the knowledge is based on position, speed, and acceleration in physics, we proved the existence of acceleration in the video motion. Moreover, we believe that the CNN processing automati- cally selected the dominant feature from the acceleration stream. 3 5 Conclusion In this paper, we propose the definition of acceleration im- ages that represent a change of a flow image. The acceler- ation stream is employed as an additional stream to a two- stream CNN. The process of the two-stream CNN picks up a necessary feature in the acceleration images with an auto- matic feature mechanism. Surprisingly, the motion recogni- tion with the acceleration stream is better than recognition with the temporal stream. Our future work is to iteratively differentiate the acceler- ation images to extract more detailed motions. In particular, we hope to capture more refined features in human motion. References [1] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005. [2] N. Dalal, B. Triggs, and C. Schmid. Human detection us- ing oriented histograms of flow and appearance. European Conference on Computer Vision (ECCV), 2006. [3] I. Everts, J. C. Gernert, and T. Gevers. Evaluation of color stips for human action recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. [4] A. Gorban, H. Idrees, Y.-G. Jiang, A. Roshan Zamir, I. Laptev, M. Shah, and R. Sukthankar. THUMOS chal- lenge: Action recognition with a large number of classes. http://www.thumos.info/, 2015. [5] M. Jain, J. Gemert, and C. G. M. Snoek. University of am- sterdam at thumos challenge2014. European Conference on Computer Vision Workshop (ECCVW), 2014. [6] M. Jain, H. Jegou, and P. Bouthemy. Better exploiting motion for better action recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. [7] H. Kataoka, Y. Aoki, Y. Satoh, S Oikawa, and Y. Mat- sui. Fine-grained walking activity recognition via driving recorder dataset. IEEE Intelligent Transportation Systems Conference (ITSC), 2015. [8] H. Kataoka, K. Hashimoto, K. Iwata, Y. Satoh, N. Navab, S. Ilic, and Y. Aoki. Extended co-occurrence hog with dense trajectories for fine-grained activity recognition. Asian Con- ference on Computer Vision (ACCV), 2014. [9] H. Kataoka, Y. Miyashita, M. Hayashi, K. Iwata, and Y. Satoh. Recognition of transitional action for short-term action prediction using discriminative temporal cnn feature. British Machine Vision Conference (BMVC), 2016. [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Neu- ral Information Processing Systems (NIPS), 2012. [11] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recogni- tion. International Conference on Computer Vision (ICCV), 2011. [12] Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Be- yond gaussian pyramid: Multi-skip feature stacking for ac- tion recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [13] I. Laptev. On space-time interest points. International Jour- nal of Computer Vision (IJCV), 2005. [14] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2008. [15] M. Marszalek, I. Laptev, and C. Schmid. Actions in context. IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2009. [16] F. Perronnin, J. Sanchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. European Conference on Computer Vision (ECCV), 2010. [17] M. S. Ryoo, B. Rothrock, and L. Matthies. Pooled motion features for first-person videos. IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2015. [18] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition. Neural Information Pro- cessing Systems (NIPS), 2014. [19] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human action classes from videos in the wild. CRCV- TR-12-01, 2012. [20] H. Wang, A. Klaser, C. Schmid, and L. Cheng-Lin. Action recognition by dense trajectories. IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2011. [21] H. Wang and C. Schmid. Action recognition with improved trajectories. International Conference on Computer Vision (ICCV), 2013. [22] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [23] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao. Towards good practices for very deep two-stream convnets. arXiv pre-print 1507.02159, 2015. [24] Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative cnn video representation for event detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [25] L. Zhu, Y. Yang, and A. G. Hauptmann. Uts-cmu at thu- mos 2015. CVPR2015 International Workshop and Competi- tion on Action Recognition with a Large Number of Classes, 2015. 4 arXiv:1608.08395v1 [cs.CV] 30 Aug 2016 Motion Representation with Acceleration Images Hirokatsu Kataoka, Yun He, Soma Shirakabe, Yutaka Satoh National Institute of Advanced Industrial Science and Technology (AIST) Tsukuba, Ibaraki, Japan {hirokatsu.kataoka, yun.he, shirakabe-s, yu.satou}@aist.go.jp Abstract Information of time differentiation is extremely important cue for a motion representation. We have applied first-order differential velocity from a positional information, more- over we believe that second-order differential acceleration is also a significant feature in a motion representation. How- ever, an acceleration image based on a typical optical flow includes motion noises. We have not employed the acceler- ation image because the noises are too strong to catch an ef- fective motion feature in an image sequence. On one hand, the recent convolutional neural networks (CNN) are robust against input noises. In this paper, we employ acceleration-stream in addition to the spatial- and temporal-stream based on the two-stream CNN. We clearly show the effectiveness of adding the ac- celeration stream to the two-stream CNN. 1 Introduction Highly discriminative motion representation is needed in the fields of action recognition, event recognition, and video understanding. Space-time interest points (STIP) that cap- ture temporal keypoints are a giant step toward solving vi- sual motion representation. An improvement over STIP is the so-called dense trajectories (DT) proposed by Wang et al. [20]. The simple purpose of DT is to have denser sam- pling and more various descriptors than STIP. In 2013, DT was improved by three techniques, namely, camera mo- tion estimation with SURF, Fisher vector representation, and detection-based noise canceling [21]. The powerful framework of DT or improved DT (DT/IDT) has been cited in numerous papers as of 2016. However, the success of convolutional neural networks (CNN) cannot be ignored in image-based recognition. We project motion informa- tion into images in order to implement the CNN archi- tecture for motion representation. The two-stream CNN is a noteworthy algorithm to capture the temporal fea- tures in an image sequence [18]. The integration of spa- tial and temporal streams allows us to effectively enhance motion representation. We obtain significant knowledge about the spatial information, which helps the temporal fea- ture. The strongest approach introduced is the crosspoint of the IDT and the two-stream CNN. Trajectory-pooled deep-convolutional descriptors (TDD) [22] have achieved the highest performance in several benchmarks, such as UCF101 [19] (91.5%) and HMDB51 (65.9%) [11]. A more recent performance was demonstrated in the Activ- ityNet challenge in conjunction with CVPR2016. At this performance, the TDD-based approach surprisingly accom- plished a 93.2% mAP (94.2% on UCF101 and 69.4% on HMDB51). However, the current approaches heavily rely on the two- stream architecture. To improve motion-based features, we must employ the acceleration stream for richer image rep- resentation. In physics, acceleration is the change rate of speed with respect to time. Here, acceleration images are able to extract a precise feature from an image sequence. In this paper, we propose the simple technique of us- ing “acceleration images” to represent a change of a flow image. The acceleration images must be significant be- cause the representation is different from position (RGB) and speed (flow) images. We apply two-stream CNN [18] as the baseline; then, we employ an acceleration stream, in addition to the spatial and the temporal streams. The ac- celeration images are generated by differential calculations from a sequence of flow images. Although the sparse rep- resentation tends to be noisy data (see Figure 1), automatic feature learning with CNN can significantly pick up a nec- essary feature in the acceleration images. We carry out ex- periments on traffic data in the NTSEL dataset [7]. 2 Related work Space-time interest points (STIP) have been a primary focus in action recognition [13]. In STIP, time t space is added to the x, y spatial domain. Improvements of STIP have been reported in several papers, such as [14], [15], [3]. However, the significant approach is arguably the dense trajectories approach (DT) [20]. The DT is describes the trajectories that track densely sampled feature points. Descriptors are applied to the densely captured trajectories by histograms of oriented gradients (HOG) [1], histograms of optical flow (HOF) [14], and motion boundary histograms (MBH) [2]. 1 Figure 1: Image representation of RGB (I), flow (I ′), and acceleration (I ′′). Dense sampling approaches for activity recognition were also proposed in [6, 8, 21] after the introduction of the first DT. These studies incremented DT, for example, by elim- inating extra flow [6] and integrating a higher-order de- scriptor into the conventional features for fine-grained ac- tion recognition [8]. Additionally, Wang et al. proposed an IDT [21] by executing camera motion estimation, cancel- ing detection-based noise, and adding a Fisher vector [16]. More recent work has reported state-of-the-art performance achieved with the concatenation of CNN features and IDT in the THUMOS Challenge [4, 5, 25]. Jain et al. em- ployed a per-frame CNN feature from layers 6, 7, and 8 with AlexNet [10]. Zhu et al. [25] extended both the representations with multi-scale temporal sampling in the IDT [12] and video representations in the CNN feature [24]. The combination of IDT and CNN synergistically improves recognition performance. Recently, CNN features with temporal representations have been proposed [18, 17, 22]. Ryoo et al. clearly bested IDT+CNN with their pooled time series (PoT) that continuously accumulates frame differences between two frames [17]. The feature is simple but effective for grasping continuous action sequences. The feature type that should be implemented, however, is one that improves the repre- sentation so that it adequately fits the transitional action recognition. It is difficult to achieve short-term predic- tion by using the PoT, because it describes features from a whole image sequence. Kataoka proposed a subtle motion descriptor (SMD) to represent sensitive motion in spatio- temporal human actions [9]. The SMD enhances the zero- around temporal pooled feature. Two-stream CNN is a well-organized algorithm that captures the temporal feature in an image sequence [18]. The integration of the spatial and the temporal streams allows us to effectively enhance the motion representation. We can obtain significant knowl- edge about how spatial information helps the temporal fea- ture. Moreover, the strongest approach introduced is at the crosspoint of the IDT and two-stream CNN. TDDs have achieved the highest performance in several benchmarks, such as UCF101 (91.5%) and HMDB51 (65.9%) [22]. 3 Acceleration images into two- stream CNN Acceleration images. The placement of acceleration im- ages is shown in Figure 1. The acceleration image I ′′ is a second-order differential from a position image I that is RGB input. The acceleration image I ′′ is shown below: I ′′ x = I′(i + 1, j) −I′(i, j) (1) I ′′ y = I′(i, j + 1) −I′(i, j) (2) where i and j are elements of x and y. I′ indicates a flow image that is calculated with optical flow displacement (d) [18]: I ′ x = dx(u, v) (3) I ′ y = dy(u, v) (4) where (u, v) is an arbitrary point. The accel- eration and flow images are stacked 10 frames in a row as (I ′′ x1, I ′′ y1, I ′′ x2, I ′′ y2, ..., I ′′ x10, I ′′ y10) and (I ′ x1, I ′ y1, I ′ x2, I ′ y2, ..., I ′ x10, I ′ y10) [18]. We implement VGGNet, which is supported by Limin Wang [23]. We integrate the acceleration stream on the two- stream CNN, in addition to the spatial and temporal streams as follows: f = fspa + αftem + βfacc (5) where f indicates the softmax function, and spa, tem, and acc correspond to the spatial, temporal, and acceleration streams, respectively. α (= 2.0) and β (= 2.0) are weighted parameters. 2 Figure 2: NTSEL dataset. CNN training. The learning procedure of the spatial and the temporal streams is based on [23]. We employ a tempo- ral net as a pre-trained model, because the 20-channel input and image values are very similar. The initial learning rate is set as 0.001 and updating is x0.1 for every 10,000 iter- ations. The learning of the acceleration stream terminates at 50,000 iterations. We assign a high dropout ratio in all fully connected (fc) layers. We set 0.9 (first fc layer) and 0.9 (second fc layer) for the acceleration stream. 4 Experiment NTSEL dataset (NTSEL) [7] (Figure 2). The dataset contains near-miss events captured by a vehicle. We fo- cused on a pedestrian’s gradual changes walking straight, turning, which is a fine-grained activity on real roads. The four activities are walking, turning, crossing, and riding a bicycle. The dataset has 100 videos of pedestrian actions. Each of the four actions has 25 videos: 15 videos for train- ing and the other 10 videos for testing. A difficulty of the dataset is to divide walking activities (e.g., walking, turn- ing, crossing) with similar appearances from the image se- quence. Primitive motion understanding is beneficial to the dataset. Results. The results are shown in Table 1. The perfor- mance rate is based on per-video calculations. The video recognition system outputs an action label for each video. Approach % on NTSEL Spatial stream 87.5 Temporal stream 77.5 Acceleration stream 82.5 Two streams (S+T) [23] 87.5 Three stream (S+T+A; ours) 90.0 Table 1: Performance rate of three-stream architecture (spatial + temporal + acceleration; S+T+A) and other ap- proaches on the NTSEL dataset. Our proposed algorithm that adds the acceleration stream significantly outperforms the two-stream CNN with an in- crease of 2.5% on the NTSEL dataset. The correct recog- nitions of the spatial, temporal, and acceleration streams are 87.5%, 77.5%, and 82.5%, respectively. Surprisingly, the acceleration stream performs better than the temporal stream. The acceleration stream effectively recognizes the movement of acceleration in the traffic data. We confirmed that the motion feature of acceleration in an image sequence improves video recognition. Although the knowledge is based on position, speed, and acceleration in physics, we proved the existence of acceleration in the video motion. Moreover, we believe that the CNN processing automati- cally selected the dominant feature from the acceleration stream. 3 5 Conclusion In this paper, we propose the definition of acceleration im- ages that represent a change of a flow image. The acceler- ation stream is employed as an additional stream to a two- stream CNN. The process of the two-stream CNN picks up a necessary feature in the acceleration images with an auto- matic feature mechanism. Surprisingly, the motion recogni- tion with the acceleration stream is better than recognition with the temporal stream. Our future work is to iteratively differentiate the acceler- ation images to extract more detailed motions. In particular, we hope to capture more refined features in human motion. References [1] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005. [2] N. Dalal, B. Triggs, and C. Schmid. Human detection us- ing oriented histograms of flow and appearance. European Conference on Computer Vision (ECCV), 2006. [3] I. Everts, J. C. Gernert, and T. Gevers. Evaluation of color stips for human action recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. [4] A. Gorban, H. Idrees, Y.-G. Jiang, A. Roshan Zamir, I. Laptev, M. Shah, and R. Sukthankar. THUMOS chal- lenge: Action recognition with a large number of classes. http://www.thumos.info/, 2015. [5] M. Jain, J. Gemert, and C. G. M. Snoek. University of am- sterdam at thumos challenge2014. European Conference on Computer Vision Workshop (ECCVW), 2014. [6] M. Jain, H. Jegou, and P. Bouthemy. Better exploiting motion for better action recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. [7] H. Kataoka, Y. Aoki, Y. Satoh, S Oikawa, and Y. Mat- sui. Fine-grained walking activity recognition via driving recorder dataset. IEEE Intelligent Transportation Systems Conference (ITSC), 2015. [8] H. Kataoka, K. Hashimoto, K. Iwata, Y. Satoh, N. Navab, S. Ilic, and Y. Aoki. Extended co-occurrence hog with dense trajectories for fine-grained activity recognition. Asian Con- ference on Computer Vision (ACCV), 2014. [9] H. Kataoka, Y. Miyashita, M. Hayashi, K. Iwata, and Y. Satoh. Recognition of transitional action for short-term action prediction using discriminative temporal cnn feature. British Machine Vision Conference (BMVC), 2016. [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Neu- ral Information Processing Systems (NIPS), 2012. [11] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recogni- tion. International Conference on Computer Vision (ICCV), 2011. [12] Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Be- yond gaussian pyramid: Multi-skip feature stacking for ac- tion recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [13] I. Laptev. On space-time interest points. International Jour- nal of Computer Vision (IJCV), 2005. [14] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2008. [15] M. Marszalek, I. Laptev, and C. Schmid. Actions in context. IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2009. [16] F. Perronnin, J. Sanchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. European Conference on Computer Vision (ECCV), 2010. [17] M. S. Ryoo, B. Rothrock, and L. Matthies. Pooled motion features for first-person videos. IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2015. [18] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition. Neural Information Pro- cessing Systems (NIPS), 2014. [19] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human action classes from videos in the wild. CRCV- TR-12-01, 2012. [20] H. Wang, A. Klaser, C. Schmid, and L. Cheng-Lin. Action recognition by dense trajectories. IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), 2011. [21] H. Wang and C. Schmid. Action recognition with improved trajectories. International Conference on Computer Vision (ICCV), 2013. [22] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [23] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao. Towards good practices for very deep two-stream convnets. arXiv pre-print 1507.02159, 2015. [24] Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative cnn video representation for event detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [25] L. Zhu, Y. Yang, and A. G. Hauptmann. Uts-cmu at thu- mos 2015. CVPR2015 International Workshop and Competi- tion on Action Recognition with a Large Number of Classes, 2015. 4