Using Event Camera to Detect, Track, and Classify the Human Body

School of Science and Technology 科技學院
Computing Programmes 電腦學系

Using Event Camera to Detect, Track, and Classify the Human Body

Chandwaney Jatin Vimal, Lui Tung Lam, Li Zhenyou, Li Zeliang


Programme	Bachelor of Science with Honours in Computer Science
Supervisor	Dr. Au Yeung Siu Kei Jeff
Areas	E-Health and Medical Applications
Year of Completion	2022

Objectives

Aim

Our project aims to apply the event camera as an all-day stationary surveillance camera to collect continuous event data to train deep learning models to detect accidents and medical emergencies for single-living elderly in low lighting environments.

Objectives

The objectives of this project are as follows:

Research methods on how to train a model using data outputted by the event camera.
Collect event data from the event camera.
Use the raw data outputted by the event camera to train a model to detect, track, and classify actions done by humans.
Convert the raw data outputted by the event camera, convert this data to frame-based data, and use the resulting data to train a model to detect, track, and classify actions done by humans.
Compare the performance of the two methods to discern which method is better, and report our findings to ASTRI.
Build our own datasets to facilitate model training.

Videos

Demonstration Video

Methodologies and Technologies used

Since there are two approaches to finding a solution, we have split our group into two smaller groups of two; one group focuses on converting the event data into frame data and training a model using that, while the other group directly uses the raw event data to train a model. Although we conduct this project in two different approaches, the methodologies are roughly the same, which are shown in the following Figure 1. (1) Data processing, (2) Model training, and (3) Approaches Evaluation. We do this to use our time more efficiently to get more work done in parallel, and we will regularly report each group’s project to know how to further continue with this project. Since this is an industry-based project, we will also be communicating our progress with ASTRI. As we progressed through the project, ASTRI provided us with more information and gave us advice on how to progress through the project.

Figure 1. Methodology Design

Frame-Based Method

Supporting Technologies

ASTRI lent us an event-based camera for the purpose of data collection. Event-based cameras generate event data and we need to convert them into frame data in order to feed into the frame-based object detection technology.

Nowadays, there are a lot of deep learning algorithms for object detection and also action detection, such as You Only Look Once (YOLO), Single Stage Detector (SDD), 3D Convolutional Neural Network, and Temporal Shift Module (TSM). In the field of image object detection, YOLO and SSD are the two of the most famous algorithms due to their good performance in speed and accuracy. While TSM is more powerful for action detection since it can learn temporal information during the training. TSM can achieve high performance like 3D CNN but also has a low complexity like 2D CNN.

In addition, to support building our action detection dataset, we will use some video editing tools to edit the event videos into small clips according to different action classes. We will collect several long videos which include actions from different orientations and then do the editing on them.

Technical gap

In order to train a model that can detect some action of the human body, a proper dataset containing different action classes has to be built first. Because we are taking advantage of conventional image object detectors that do not accept the event data as input, frame data converted from raw event data will be used for training data. For image object detection like YOLO and SSD, we will transform event data into images to build the dataset, and annotate the images to indicate the region of the human body which is the most important for training models.

For action detection like TSM, we will first convert our frame data to video format. Then according to different action classes, editing the video into some small clips which contain only one specific action. Temporal information is much significant in action detection, so static images should not be used for building this dataset.

Before stepping into model training, we will need to build our dataset first. It can be divided into three parts which are (1) data collection (2) data conversion (3) data annotation. ASTRI has provided us with some suggestions on what human actions to detect. It is about the action of the elderly falling over and medical emergencies, especially at night. The method we use to convert raw event data to frame data is constructing an image in three channels using the event data in a predefined time period, events within this time period will be put in the same image, which is like the RGB images. We derived this conversion function with the help of ASTRI.

We will start the frame-based method with static image detection. Using the human body dataset provided by ASTRI which data is images, we train two image detection models of YOLOv5 and SSD respectively, and evaluate their performance to obtain the best model for detecting the human body in event images. Ideally, static image detection on event images should work perfectly since we have converted event data into a common image format. While the first stage is done, we will go forward to the second stage which is the action detection. We would like to use the best model we obtained from static image detection to perform action detection first to see if it can capture the temporal information from continuous action. Not surprisingly, it will fail to detect action from video. The most important reason for failure is that static image detection algorithms do not have a mechanism to learn the temporal information.

ASTRI gave us some suggestions on action detection, Temporal Shift Module(TSM) is an ideal algorithm for performing action detection. It applies some shifting along the temporal dimension during training, which makes it more powerful for understanding continuous actions, and also achieves high performance. For TSM training, we collected four action videos which are asking for help, falling down, heart attack, and headache. And then we edit them into small clips which contain a specific action to fit into TSM training. In the end, we will evaluate the TSM model to see if it does can detect action.

The model we train can apply to specific scenarios in the future development: detecting the accidents from the four action classes that may happen to the elderly, especially at night. We can take advantage of event-based cameras over traditional cameras to better protect the safety of the elderly.

System Design

The main principle of our system is that we will take advantage of the frame-based object detection technologies which are easily available at the current stage to perform action detection on event-based camera data. Figure 2. shows the basic design idea of the frame-based approach.

Figure 2. Frame-data model training process

Event-based Method

Supporting Technologies

To be able to use the camera's raw data, we need to use a specialized neural network that is able to deal with the sparse and continuous data provided by the event camera. To support the raw data, we will use the data to train a spiking neural network using a library called SpikingJelly (Multimedia Learning Group, Institute of Digital Media (NELVT), Peking University, n.d.), as mentioned in section 2, this library uses an analog neural network and converts it to an SNN by following some steps to simulate a SNN. The implementation of the SNN is done using PyTorch. To be able to use the raw data in the SNN, we have to code a custom dataset class for our dataset, which allows PyTorch to load the data and send it to the neural network.

Technical Gap

SpikingJelly attempts to simulate a spiking neural network, and while it does it with some accuracy, it can never be as accurate as a native implementation of a spiking neural network, which is only able to be used by using specialized hardware. Most of the other neural networks which allow us to train a model using raw data require an extensive amount of knowledge in machine learning to implement, and the research papers don't provide a programmed implementation of these networks, due to the lack of time and urgency to train a model, we cannot attempt to implement these methods as we don't know if they will function properly since we don't have a low-level understanding of how these networks function.

In order to train the model with raw data, we need to label the data in such a way that the model knows what sort of event is occurring at any given time. The first major part of our labeling system is a flag integer value, which indicates the event that is happening during this label. The flag will have three values initially: 1 (ask for help), 2 (fall down), 3 (headache), and 4 (heart attack). Along with the flag value, we need to have two timestamps, the first timestamp being the start time of that event in microseconds, and the second timestamp being the end time of that event in microseconds. Using this as our labeling system will allow us to train a model using a longer recording, rather than having many smaller recordings.

Since we don't have a neuromorphic processor that will enable us to directly deal with the raw data, we have to use a simulated method using a tempotron supervised learning algorithm. The tempotron learning algorithm uses the leaky integrate and fire neuron model (LIF neuron). The LIF neuron uses two basic items: (1) a linear differential equation to describe the evolution of the membrane potential, and (2) a threshold for spike firing (Gerstner, W, n.d.). The LIF neuron model assumes that the spatial and temporal integration of inputs is linear (Florian, R. V, 2008). To train the raw data, you send each individual spike as the input to the tempotron, each input initiates the postsynaptic potential kernel (PSP kernel), and we calculate the neurons' total potential by summing up all the PSP kernel calculations for each individual spike input, and if the resulting sum is higher than a specific threshold, the neuron will fire an output spike, which for our purpose, is the predicted label for the input sample.

System Design

As stated above, we will be using SpikingJelly to train a model using the raw data. Figure 3.3 shows the high-level flowchart on the steps required to train a model using this method.

Figure 3. High-level view of training with raw data

Implementation

Self-collected Dataset

A self-collected dataset referred to as Single-living Elderly Accident Dataset (or SEAD) has been created to facilitate our aim; an event camera is being used as an all-day surveillance camera to detect accidents for the single-living elderly due to its properties. For example, some are (1) energy and storage efficiency, (2) adequate for challenging lighting conditions, and (3) compact size. These properties of the event camera make it a much more ideal candidate than the conventional RGB camera since all-day monitoring means lots of data will be generated, and the low lighting conditions in the evening and night are too challenging for conventional cameras.

We want to do action detection on elderly-specific accidents and medical emergencies. As there is no other dataset that fits our needs, we built the dataset on our own by using the event camera lent from ASTRI.

Our dataset SEAD contains four action classes including: “fall down”, “seek help”, “heart attack”, and “headache”. These four actions are selected because (1) they are distinctly different from each other; a counterexample will be “heart attack” and “stomachache” as the physical proximity of the heart and stomach are too close to each other, and (2) can be easily mimicked by students. Please refer to the presentation video which shows part of our SEAD dataset.

As for the data acquisition, the event camera system being used has a VGA resolution of 640×480 and is made of Gen3.1 VGA Sensor – PPS3MVCD on the frame Evaluation Kit 1 (EVK1 – VGA) from PROPHESEE. It is connected to a laptop running the Linux Ubuntu 20.04 64-bit, with Metavision Essentials and other necessities installed. As the environments used for data collection don’t have the non-flickering lighting (e.g., halogen lighting) installed, many background noises are captured by the event camera. Two ways to perform preliminary denoising on data collection. (1) Adjust the lens to a smaller focal length, as it will directly reduce the amount of light reaching the sensor physically. A focal length of 8 mm or 16 mm will give acceptable crispness through our testing. Anything lower than 8 mm will become too blurry. (2) The camera parameter “bias_fo” is tuned up to suppress the noisy background events; this will sacrifice some of the performance/speed for a less noisy input. We found the “bias_fo” value 1637 to be a sweet spot for the performance/data quality tradeoff. It significantly alleviates the data quality issue while giving an acceptable performance for our surveillance application. These steps work as directly preliminary denoising during data collection on the hardware level.

We collected our data based on these assumptions deduced from our aim: (1) Single-living elderly means most of the time, only one person will exist in the camera’s field of view, so we only consider cases where only one person exists. (2) The environments include the living room, bedroom, etc. (3) Surveillance camera will be used all day; the lighting contrast throughout the day is enormous (completely dark/bad lighting conditions at night). (4) Surveillance camera will be placed at a certain height; two heights of 70 cm and 3 meters are tried (normally, the camera will be placed as high as possible to have the best view, 70 cm is to mimic the camera being placed on a table, where the ceiling placement is not available in some environments, at this height, subjects are more likely to be blocked by other obstacles form the camera’s field of view). (5) Accidents could happen at any moment or place in an environment where the person may not face the camera directly. Actions need to be captured at most possible angles; hence subject will perform the same action rotationally facing at least eight different directions.

Two subjects participated in the data collection, and 2 hours’ worth of data in RAW format was collected, which were transferred to other data formats to meet our needs for different event-based and frame-based formats (e.g., 2 hours’ worth of data transferred to 216k frames in 30 FPS). In these 2 hours, 30 minutes are recorded with the camera height placement at 70 cm, and 90 minutes are recorded with the camera height placement at 3 meters. Through manual annotation, there is a total of 505 labeled data points; “fall down” has 118 counts, “seek help” 312 counts, “heart attack” 46 counts, and “headache” has 29 counts. Among these four action classes, “seek help” occupies a significant portion of our dataset due to different interpretations. Although people generally understand the action of “seek help” or “waving hand”, they will perform the action differently like they wave either hand or both hands. The waving frequency, duration, direction, and range of motion may vary from person to person. We want to capture the variety as much as possible by recording versions like only palm/forearm/shoulder waving. Each of these can be done using either right/left hand or both hands with different durations and range of motion. While people will generally have more consistent interpretations of the other three action classes, “fall down”, “heart attack”, and “headache” (i.e., these actions have less variety compared to “seek help”).

Frame-based Method

We start with static image detection first to see if human body detection is successful in event images. Then we go forward to action detection in videos collected by event cameras which holds more applications for future development. After discussing with ASTRI and our supervisor Jeff, YOLOv5 and SSD are the two deep learning algorithms we used for static image detection and TSM is used for action detection.

For static image detection, ASTRI provides us with a batch of event images to train a model, which has already been converted from raw event data collected by the event camera. Most of the images are of the human body in an environment and some of them may be just an environment or a part of the human body.

Figure 4. Event Image

We first just implement human body detection, so there is only one class which is “body” we need to classify. Since the data that ASTRI provided is large, around 39k images, we annotated these data in this way: we first annotate six hundred to one thousand images manually, then we fit these annotated images into SSD training to obtain a model, which can help us to automate the remaining annotation. It is an annotation method for saving our time and labor.

For action detection, we try to detect four classes of action from the event camera: Ask for help, Fall down, Heart attack, and Headache, and collect these four actions from different orientations into videos.

Figure 5.Event Videos

Event-based Method

As stated in previous sections, for the event-based method we will be using spiking jelly to train the raw data, as not only is spiking jelly the only available implementation for non-neuromorphic processors that we have been able to find, but also other implementations of SNN use spiking jelly as a low-level framework.

The first step in our implementation is to process the raw data into separate samples shown in our label files. The first step is to create the events_np directory which has a train and test directory within them, both populated by folders for each label, which contain npz files that contain the spike data for each sample. These npz files will be accessed by the model and used to train and evaluate the model. Knowing each raw file at times contains gigabytes worth of spike data, which could lead to multiple tens of millions of lines per raw data file, we need to find a way to load and read the entire file into memory without taking too much time. The default python method of reading these files would have taken a couple of minutes to just load the file into memory, so we chose to use the read_csv method from pandas, which uses NumPy as its low-level framework, which itself uses C to allow fast and efficient array methods. After loading in the CSV file, we save each column as a NumPy array in a dictionary and return that dictionary to be saved.

Once we have gotten a dictionary of the entire raw file, we have to extract each sample and save them to their respective label folder in the train or test set. This is done by finding all events that lie between a range specified by the start_time and end_time columns in our labeling file and saving all the events that lie between that range into a npz file in our events_np folder.

We are able to differentiate between samples that should be in the train set, and samples that should be in the test set by placing the name of the raw file in either trials_to_train.txt or trials_to_test.txt, and these are the files that our preprocessing pipeline uses to process each raw file. After this, we can start to train the model.

We will be training the model for around 100 epochs at most, as each epoch can take up to 10 hours depending on the number of samples you have to train. We will also set the learning rate to 0.001 for our optimizer and encoder. We are using stochastic gradient descent optimization and gaussian tuning encoding.

For each epoch, we are first going to train the model, then after training, evaluate the model, so we can see a real-time comparison to see how the model improves at predicting the raw spikes. In each epoch, we loop through each spike in each sample, and send it to the model to train, we increment a counter variable if the spike was predicted correctly, and to see if the PSP calculations threshold was reached, we check if more than 66% of all spikes were predicted correctly, and if they were, then we can say the sample was predicted correctly. We calculate the training loss on how many spikes were predicted incorrectly, and we calculate the training accuracy based on the amount of correctly predicted samples. We repeat the training process for testing but put the model into evaluation mode. After every epoch, we save the model to the directory so we have an updated version after every epoch since each epoch takes a long time to run.

Evaluation

Frame-based Method

We performed some testing on trained models of YOLOv5 and SSD first. As much as we expected, they can detect the human body perfectly well in static event images. But compared YOLOv5 with SSD, the former obtained a better performance according to the evaluation of mAP. YOLOv5 reached around 0.99 of mAP, while SSD only got around 0.75 of mAP.

Figure 6. Static image detection comparison – YOLOv5 and SSD

From the visualization output we can see that the accuracy and precision of YOLOv5 outperform SSD. Some human bodies are overlapped in the event images and can be easily detected by YOLOv5 but poorly by SSD. This indicates that YOLOv5 is more powerful for understanding the human body features in event images than SSD. So we go forward, we try to pass some event videos that show a person doing an action to YOLOv5 to see if it can detect action from the human body as well. It turns out that YOLOv5 fails to do so. The main reason for that is that static image detection algorithms lack the mechanism for learning the temporal information in videos. Linking the current frame with previous and future frames during the training is essential for understanding the videos. That is why we need TSM to perform our action detection.

So we proceed to evaluate the TSM model. We fitted the event videos in the test set to perform testing, Figure 7. shows the testing result.

Figure 7. Test result of TSM

We used 8 videos of seeking help, 18 videos of falling down, 1 video of headache, and 2 videos of heart attack to test our TSM model. We can tell that the detection of seeking help and falling down reached a relatively high accuracy while the other two classes are in the opposite way. Even though TSM can detect some actions in our data, such high accuracy may indicate some problems in our dataset or other components. We also used another test set to test it but the results almost remain the same: high accuracy for seeking help and falling down while the other two can not be detected.

From these evaluations, we have concluded that: (1) our dataset for TSM training is relatively small, especially for headache and heart attack classes, there are only around 10 videos for each of them. No sufficient data for TSM to understand the action well is the main reason that headache and heart attack can not be detected. (2) due to the COVID-19, the event camera is held by one teammate throughout the data collecting process. All the actions we used to do training and testing are collected by one teammate which decreases the degree of diversity in our dataset. The test set is similar to the training set, which makes the high accuracy of seeking help and falling down in testing not so reliable.

Event-based Method

After every epoch that we trained, we tested the performance of the model against our training set. Initially, we expected the training accuracy to start low, but slowly increase as we trained the model for longer, as is normally expected in any functioning machine learning model. However, while training, our model started at 20% for both training and testing, but as time went on, the training and test accuracy remained constant, while the training and testing loss tended toward zero the longer the model was trained for. Currently, the only accurate prediction that we have is for the 'seeking help' label, which is being predicted correctly 100% of the time, while the other labels aren't being predicted correctly.

Figure 8. Training results of SNN

In the preprocessed dataset, there are five labels in total, the labels that we outlined above in our report, as well as an empty label '0' that contained about one hundred samples, which shouldn't have affected training at all. One thing that we can do to improve this, is to first increase the number of training samples for the model to learn from, to speed up the training time, we only trained one sample per label, since we didn't have much time to get results. So given more time, we want to use at least five samples per label and hopefully, that can improve the model performance.

Another issue that could be causing low model performance, is that the model can't accurately relate our input to our labels, which is caused by the model being too simple, this can be solved by increasing the model flexibility, or changing the tempotron model itself.

Figure 9. Testing results of SNN

Conclusion

We collected a Single-living Elderly Accident Dataset(SEAD) using an event camera. It contains four action classes which are seeking help, falling down, headache, and heart attack. There are two types of data in SEAD, one is the raw event data output by the event camera, and the other is the video data converted from the event data. We conducted two experiments using SEAD’s two types of data to perform human action detection.

Through our experiment on video data, we show that frame-based image detection algorithms like YOLO and SSD can detect the human body in frames that are extracted from event videos with relatively high accuracy. But they can hardly work on detecting actions in videos obviously since no temporal information is learned. Compared to them, TSM is more powerful in action detection. Although the TSM model we trained can not detect all the four classes in SEAD due to the limitation of our dataset, it still learned the action of seeking help and falling down. Certainly, all the experiments conducted by frame-based algorithms are on the basis of conversion from raw event data to video data.

In our second experiment on the raw event data, we show that while it is possible to input raw data to be trained, so far, it hasn't been as successful as we have wanted it to be. While this could be due to the fact that we didn't use much data while training the model, since we don't have enough time to test this hypothesis, we will leave it up to future groups to investigate.

We have seen that so far, TSM is more accurate than our tempotron SNN, as our SNN is only able to accurately predict the seeking help label, while the TSM model is able to accurately predict seeking help as well as falling down.

If we were given more time, we would have liked to conduct more testing on SNN implementation like fitting frames for training instead of raw data. SNN can be trained with both raw event data and frame data, and also contain temporal information extraction, so more experiments can be conducted on it. This also may be a better method for using an SNN, as the majority of implementations use spiking jelly, convert raw data to frames, and train the SNN model. Moreover, we also would like to improve our dataset SEAD by adding more data for each class and increasing the diversity in SEAD. Our experiments give a demonstration of applying deep learning algorithms to train models with event data in the computer vision area. We believe that there are good prospects for the application of utilizing event cameras in the computer vision area like performing detection under low light conditions or capturing movement in challenging lighting environments.