The objective of this study is to analyse a dataset of smartphone sensor data of human activities of about 30 participants and try to analyse the same and draw insights and predict the activity using Machine Learning. We also try to detect if we could identify the participants from their walking styles and try to draw additional insights. The potentials of such a study can be exploited to scenarios such as activity detection, monitoring persons for signs of fatigue, distinguishing one individual from another , with possible deployment in highly sensitive and secure workplaces etc.
Data Set Source
Jorge-L. Reyes-Ortiz, Luca Oneto, Albert Sama, Xavier Parra, Davide Anguita. Transition-Aware Human Activity Recognition Using Smartphones. Neurocomputing. Springer 2015.
Data Repository : UCI
2. Dynamics Experiment and Data capture (from cited Journal)
Data capture overview
The experiments were carried out with a group of 30 volunteers within an age bracket of 19-48 years. They performed a protocol of activities composed of six basic activities: three static postures (standing, sitting, lying) and three dynamic activities (walking, walking downstairs and walking upstairs). The experiment also included postural transitions that occurred between the static postures. These are: stand-to-sit, sit-to-stand, sit-to-lie, lie-to-sit, stand-to-lie, and lie-to-stand. All the participants were wearing a smartphone (Samsung Galaxy S II) on the waist during the experiment execution. We captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz using the embedded accelerometer and gyroscope of the device. The experiments were video-recorded to label the data manually.
The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cut off frequency was used.
3. Feature selection
The features selected for this database come from the accelerometer and gyroscope 3-axial raw signals tAcc-XYZ and tGyro-XYZ. These time domain signals (prefix ‘t’ to denote time) were captured at a constant rate of 50 Hz. Then they were filtered using a median filter and a 3rd order low pass Butterworth filter with a corner frequency of 20 Hz to remove noise. Similarly, the acceleration signal was then separated into body and gravity acceleration signals (tBodyAcc-XYZ and tGravityAcc-XYZ) using another low pass Butterworth filter with a corner frequency of 0.3 Hz.
Subsequently, the body linear acceleration and angular velocity were derived in time to obtain Jerk signals (tBodyAccJerk-XYZ and tBodyGyroJerk-XYZ). Also the magnitude of these three-dimensional signals were calculated using the Euclidean norm (tBodyAccMag, tGravityAccMag, tBodyAccJerkMag, tBodyGyroMag, tBodyGyroJerkMag).
Finally a Fast Fourier Transform (FFT) was applied to some of these signals producing fBodyAcc-XYZ, fBodyAccJerk-XYZ, fBodyGyro-XYZ, fBodyAccJerkMag, fBodyGyroMag, fBodyGyroJerkMag. (Note the ‘f’ to indicate frequency domain signals).
These signals were used to estimate variables of the feature vector for each pattern’-XYZ’ is used to denote 3-axial signals in the X, Y and Z directions.
Primary attributes time domain
Primary attributes frequency domain
Statistically derived features
The set of variables that were estimated from these signals are:
1. mean(): Mean value
2. std(): Standard deviation
3. mad(): Median absolute deviation
4. max(): Largest value in array
5. min(): Smallest value in array
6. sma(): Signal magnitude area
7. energy(): Energy measure. Sum of the squares divided by the number of values.
8. iqr(): Interquartile range
9. entropy(): Signal entropy
10. arCoeff(): Autorregresion coefficients with Burg order equal to 4
11. correlation(): correlation coefficient between two signals
12. maxInds(): index of the frequency component with largest magnitude
13. meanFreq(): Weighted average of the frequency components to obtain a mean frequency
14. skewness(): skewness of the frequency domain signal
15. kurtosis(): kurtosis of the frequency domain signal
16. bandsEnergy(): Energy of a frequency interval within the 64 bins of the FFT of each window.
17. angle(): Angle between to vectors.
Additional vectors obtained
Additionally by averaging the signals in a signal window sample the following features were obtained
Activities captured in the data set used
4. Exploratory Data Analysis
We explore the combine train and test data set and try to see how the data is distributed and the how separable are the participant and activity labels.
Distribution of Activities
We try to check the data volumes for each of the activities in order to see how balanced the data sets is with respect to the different activity labels. We find from the figure below that the volumes are more or less well balanced
Separation of Activities using First two Principal Components
Next we try to evaluate visually if the classes can be separable. As we have a multivariate data at hand, in order to plot in two dimensions we have two different approaches. In the first approach we try to reduce the dimensionality of the data using the first two principal components. We find from the plot below that PCA through an established technique of dimensionality reduction, still results in some class over lap if we just use the first two PC’s.
Fig. 2 Separation of Activity labels using PCA
Separation of Activities using t- distributed stochastic neighbour embedding
We have seen from Fig. 2 that the PCA method does not help us to get a very good visual separation in two dimensions. We try a multivariate dimensionality reduction technique next. The t-Distributed Stochastic Neighbour Embedding (t-SNE) is a non-linear technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. It is extensively applied in image processing, NLP, genomic data and speech processing. We find from Fig. 3 a much better separation in this case.
Fig. 3 Separation of Activity labels using t-SNE
Separation of participants using t- distributed stochastic neighbour embedding
We have seen from Fig. 3 that the t-SNE method does give us a goof visual separation. Let us now try to check if the participants themselves are separable in sense of their different activities. If so this could be exploited to build prediction models wherein one could identify a person from his or her walking style. We find from Fig. 4 that such a possibility exists.
Fig. 4 Separation of Participants using t-SNE
Here we try to fit an ensemble of models to predict the human activity, including techniques like Bagging, Random Forest, Boosting as well as deep learning
We first try to check for OOB error vs number of trees by plotting OOV vs 10 different combinations of number of trees ranging from 50 to 300 in steps of 25 trees. We find from the plot in Fig. 5 that about 200 trees are optimal for the current data set with bagging.
Fig. 5 OOB Error Vs number of trees plot
Performance at optimal number of trees
We observe from Fig. 5 that about 200 trees would be an optimal tree count. We then run a bagging model with n=200 trees and plot the multiclass confusion matrix as depicted in Fig. 6. We find that the activity ‘standing’ is perfectly classified, however between activities ‘sitting’ and ‘laying’ misclassification is observed and this is expected as they are close to each other as far as positions are concerned. Next we find when comparing the three forms of walking a small degree of misclassification between the three forms of walking again this is expected. We also plot the ROC curves as seen in Fig. 7. We find the same trend of accuracy from the ROC Curves as well.
Fig. 6 Bagging (n=200) Optimal number of trees Confusion Matrix on Test Data
Fig. 7 ROC curves for each activity from BAGGING
On similar lines as in bagging we first try to check for OOB error vs number of trees by plotting OOV vs 10 different combinations of number of trees ranging from 50 to 300 in steps of 25 trees. We find from the plot in Fig. 8 that about 200 trees are optimal for the current data set with Random forest.
Fig. 8 OOB Error vs N Trees for random forest
Performance at optimal number of trees
We observe from Fig. 8 that about 200 trees would be an optimal tree count. We then run a RF model with n=200 trees and plot the multiclass confusion matrix as depicted in Fig. 9. We find that the activity ‘standing’ is perfectly classified, however between activities ‘sitting’ and ‘laying’ misclassification is observed and this is expected as they are close to each other as far as positions are concerned. Next we find when comparing the three forms of walking a small degree of misclassification between the three forms of walking again this is expected. We also plot the ROC curves as seen in Fig. 10. We find the same trend of accuracy from the ROC Curves as well. We also observe a marginal improvement in Accuracy with RF compared to Bagging
Fig. 9 Confusion Matrix for Random Forest : n = 200 trees
Fig. 10 ROC Curves from Random Forest with n = 200 estimators
Extra trees classifier
We further try a variant of Random forest, i.e. the Extra Trees classifier. In this case the samples are drawn randomly without replacement and further the splits are also chosen randomly. On similar lines as in RF we first try to check for OOB error vs number of trees by plotting OOV vs 10 different combinations of number of trees ranging from 50 to 300 in steps of 25 trees. We find from the plot in Fig. 11 that about 225 trees are optimal for the current data set with Extra Trees classifier.
Fig. 11 OOB Error vs N Trees for Extra Trees classifier
Performance at optimal number of trees
We observe from Fig. 11 that about 225 trees would be an optimal tree count. We then run a Extra Trees classifier model with n=225 trees and plot the multiclass confusion matrix as depicted in Fig. 12. We find that the activity ‘standing’ is perfectly classified, however between activities ‘sitting’ and ‘laying’ misclassification is observed and this is expected as they are close to each other as far as positions are concerned. Next we find when comparing the three forms of walking a small degree of misclassification between the three forms of walking again this is expected. We also plot the ROC curves as seen in Fig. 12. We find the same trend of accuracy from the ROC Curves as well. We also observe a marginal improvement in Accuracy with ETC compared to Random Forest.
Fig. 12 Confusion Matrix : Extra Trees Classifier : n =225 trees
Accuracy Measure comparison
- Bagging with 200 estimators: Accuracy on testset: 90.09%
- RF with 200 estimators: Accuracy on testset: 92.87%
- Extra trees with 225 estimators: Accuracy on testset: 93.65%
6. Additional insights
Feature Importance and Sensor Importance
From our Random Forest Classification model we can get the feature importance of every feature. We present the top ten features which have the most classification power for the data at hand. We also check between the two sensors i.e. Accelerometer and Gyroscope how are they placed in the overall feature importance that is to say we sum up the feature importance of ever individual features from variables generated or computed from the two sensors and plot them to have a visual comparison. Fig. 13 shows the top ten features with most classification power and Fig 14. Shows the comparative classification power between the two types of sensors used in the study , i.e. gyroscope and accelerometer.
Fig. 13 Comparative Sensor Importance for classification
Fig. 14 Comparative Sensor Importance for classification
Participants Walk Analysis 1: Walk Time
In this study we capture the time involved in the activity ‘Walking’ for the different participants and we try to study the distribution to check for anomalies if any. We find from Fig. 15 that the data is distributed over a range. We assume that the test subjects had a fixed walk distance and the variations for the walk time are natural. We check this by checking (See Fig. 16) if the data follows a normal distribution as in most real world scenarios. We find that the data is close to normal distribution barring 2 or 3 outlier points
Fig. 15 Walk time for 30 participants
Fig. 16 Normality check for the Walk data
Participants Walk Analysis 2: Walk style analysis for each participant
In this study we try to capture the walk pattern or walk signature for each participant. This is an important feature to be studied, as walk patterns can be used to distinguish and with more data points even identify uniquely individuals or capture change in physical behaviour like signs of fatigue vs healthy walking etc. In order to plot the multivariate data in two dimensions, we reduce the data dimensions to just two using PCA and t-SNE (t Distributed Stochastic Neighbour Embedding) as shown in Fig. 17. We observe from the plots that each participant has a different walk pattern. We also note that for most of the patterns we can find two clusters. We can infer from the patterns as each cluster may denote a ‘Walk experiment’ .
Fig. 17 Walk patterns for individual participants
7. Conclusions and Future Scope
In this case study we showcased how machine learning can help to study data from sensors which are already present in most smartphones maybe analysed to gain rich insights about the candidates studied. We can identify activities, classify or group participants to activities, get additional insights of activity durations and patterns of individuals involved in those activities. One can exploit the rich scope such insights has to offer in developing real time human asset monitoring in highly secured installations, tracking Elderly or population with movement disability or illness for any emergencies based on movement patterns, determining if a person id under fatigue or not and so on and so forth. The application domains are as broad as from healthcare to security services and fitness monitoring.
Relevant Kaggle Kernels References
Relevant Literature References
1. Jorge-Luis Reyes-Ortiz, Luca Oneto, Alessandro Ghio, Albert SamÃ¡, Davide Anguita and Xavier Parra. Human Activity Recognition on Smartphones With Awareness of Basic Activities and Postural Transitions. Artificial Neural Networks and Machine Learning â€“ ICANN 2014. Lecture Notes in Computer Science. Springer. 2014.
2. Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.
3. Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, Jorge L. Reyes-Ortiz. Energy Efficient Smartphone-Based Activity Recognition using Fixed-Point Arithmetic. Journal of Universal Computer Science. Special Issue in Ambient Assisted Living: Home Care. Volume 19, Issue 9. May 2013
4. Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. 4th International Workshop of Ambient Assited Living, IWAAL 2012, Vitoria-Gasteiz, Spain, December 3-5, 2012. Proceedings. Lecture Notes in Computer Science 2012, pp 216-223.
5. Jorge Luis Reyes-Ortiz, Alessandro Ghio, Xavier Parra-Llanas, Davide Anguita, Joan Cabestany, Andreu CatalÃ . Human Activity and Motion Disorder Recognition: Towards Smarter Interactive Cognitive Environments. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.
Dr. Anish Roy Chowdhury is currently an Industry Data Science professional mentoring Karmaa Lab. In previous roles he was with ABInBev as a Data Science Research lead working in areas of Assortment Optimization, Reinforcement Learning to name a few, He also led several machine learning projects in areas of credit Risk, Logistics and Sales forecasting. In his stint with HP Supply Chain Analytics he developed data Quality solutions for logistics projects and worked on building statistical models to predict spares part demands for large format printers. Prior to HP he has 6 years of Work Experience on the IT sector as a Data Base Programmer. During his stint in IT he has worked for Credit Card Fraud Detection among other Analytics related Projects. He has a PhD in Mechanical Engineering (IISc Bangalore) . He also holds a MS degree in Mechanical Engineering from Louisiana State Univ. USA. He did his undergraduate studies from NIT Durgapur with published research in GA- Fuzzy Logic applications to Medical diagnostics
Dr. Anish is also a highly acclaimed public speaker with numerous best presentation awards from National and international conferences and has also conducted several workshops in Academic institutes on R programming and MATLAB. He also has several academic publications to his credit and is also a Chapter CO – Author for a Springer Publication. He has extensively contributed to the revision of a bestselling MATLAB Book from Oxford University Press, being the sole contributor to chapters on Data analysis and Statistics.