Real-Time Human Activity Recognition using 3D CNN: A Combinatorial Deep Learning Approach

Srinath Bondala; Doddapaneni Meghan Chowdary; K.Ritika Reddy; Mr.B.K.Chinna Maddileti

doi:10.17577/IJERTV14IS040023

Volume 14, Issue 04 (April 2025)

Real-Time Human Activity Recognition using 3D CNN: A Combinatorial Deep Learning Approach

DOI : 10.17577/IJERTV14IS040023

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 16
Authors : Srinath Bondala, Doddapaneni Meghan Chowdary, K.Ritika Reddy, Mr.B.K.Chinna Maddileti
Paper ID : IJERTV14IS040023
Volume & Issue : Volume 14, Issue 04 (April 2025)
Published (First Online): 12-04-2025
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Real-Time Human Activity Recognition using 3D CNN: A Combinatorial Deep Learning Approach

Srinath Bondala

Computer Science and Engineering CMR College of engineering & technology, kandlakoya, Hyderabad, India

Doddapaneni Meghan Chowdary Computer Science and Engineering CMR College of engineering & technology, kandlakoya, Hyderabad, India

Mr.B.K.Chinna Maddileti Assistant Professor Department of CSE

CMR College of Engineering & Technology, kandlakoya, Hyderabad, India

K.Ritika Reddy

Computer Science and Engineering CMR College of engineering &

technology, kandlakoya, Hyderabad, India

Abstract: 3D Convolutional Neural Networks (3D CNNs) for Human Activity Recognition (HAR) has emerged as a developing area with significant uses in human computer interaction, monitoring and even in medicine. To ensure accurate human action tracking and recognition, the research presents a system that uses combination of different deep learning approaches for real time HAR. Unlike traditional methods and approches, our approach simultaneously performs temporal feature extraction and object classification, providing accurate classification across a wide range of activities. The proposed solution addresses problems of motion diversity, inference in real-time, and model generalization through video stream processing for live and archive camera footage. The methodology processes data for classification, model training, and Classification accuracy, latency and other system metrics help measure performance and prove its viability for practical applications.

TensorRT acceleration and model quantitation are two optimization being investigated for deployment on edge devices to improve real- time performance. The findings bridge the gap between deep learning advances and real-world usability, hence aiding in the creation of effective HAR systems. Future study will focus on improving lightweight structures, expanding datasets, and improving tracking techniques to increase real-time recognition abilities.

Keywords: deep learning, 3D CNN, video-based activity recognition, surveillance systems, healthcare monitoring, edge computing, computer vision, motion analysis, model optimisation, object detection, real-time tracking, human-computer interaction, and real- time detection, Yolov8, X3D_M, Deepsort .

I. INTRODUCTION

Human Activity Recognition (HAR) has become one of the important tool in several fields, such as surveillance, smart environments, and healthcare. In the applications like fitness tracking, workplace safety, human-computer interaction, and monitoring of senior care depend on the capacity to precisely identify and categorise human actions. Deep learning-based solutions are required since traditional activity detection

techniques, which depend on manually created features and sensor-based methods, frequently have problems with generalisation and real-time performance. Due to changes in illumination, occlusions, body positions, and disturbed background, activity detection utilizing vision-based approaches cause special difficulties. Recurrent neural networks (RNNs) or 2D CNNs are used in many traditional methods, but they might not fully represent the temporal dynamics of human behavior. By concurrently extracting spatial and temporal data, 3D CNNs offer a more efficient way around these drawbacks and improve recognition accuracy across a range of activity classes.

Our approach’s emphasis on real-time processing is one of its main innovations, which qualifies it for use in dynamic settings like smart fitness applications, assisted living facilities, and surveillance systems. The system can be seamlessly integrated into real-world applications because it is optimized to ensure low latency while keeping excellent accuracy. Because it enables continuous tracking and monitoring of human behaviors.

There are still a number of difficulties in recognising human activities, even with recent improvements. Classification accuracy may be impacted by problems such overlapping activities, occlusions in congested areas, and changes in motion intensity. Furthermore, the generalisation across many demographic groups and activity patterns is still difficult to mitigate these restrictions, we have investigated techniques like dataset augmentation, transfer learning, and fine-tuning models on varied datasets.

LITERATURE SURVEY
Literature Review on Deep Learning for HAR

Deep learning has played a crucial role in advancing human activity recognition (HAR), enabling accurate classification of actions in diverse environments such as surveillance, healthcare, and human-computer interaction. Researchers have explored various deep learning models, integrating computer vision and sensor-based methodologies to enhance recognition efficiency and real-time applicability. The integration of deep learning in HAR has significantly improved the ability to monitor and analyze human motion in real-world settings.

The emergence of deep learning-based HAR was accelerated by the need for automated activity monitoring. Wang et al. [1] implemented a CNN-LSTM hybrid model for HAR using wearable sensor data, demonstrating superior accuracy compared to traditional machine learning approaches. Their study showcased how combining spatial and temporal features enhances recognition performance. Similarly, Liu et al. [3] achieved great precision in identifying complicated activities by using 3D Convolutional Neural Networks (3D CNNs) to extract spatiotemporal information from video recordings. In order to improve activity classification in congested surroundings, Zhou et al. [2] used an attention-based LSTM network, demonstrating the potential of attention processes to increase recognition accuracy.

For HAR, several deep learning architectures have been put out. C3D, a 3D CNN model that can directly learn motion patterns from unprocessed video data, was presented by Tran et al. [6]. Because C3D could record sequential dependencies, it performed better in video-based HAR tasks. In order to process temporal information more efficiently, Carreira and Zisserman
[7] proposed the I3D paradigm, which extends 2D CNNs into 3D space. This model demonstrated its capacity to identify fine- grained behaviors in large-scale datasets. By utilising optical flow representations, Simonyan and Zisserman [9] presented a two-stream CNN model that improves recognition accuracy by processing motion and spatial information independently.

To further enhance HAR accuracy, Wang et al. [8] presented Action-former, a Transformer-based model that uses self- attention mechanisms to capture long-range dependencies in video sequences, in an effort to further improve HAR accuracy. Activity classification was greatly enhanced by this method, especially for intricate motion patterns. The Vision Transformer (ViT), created by Vaswani et al. [10], substitutes self-attention for convolutions and shows competitive performance in image- based HAR tasks. ViTs have improved performance in a variety of HAR settings by utilising global feature extraction, which lessens the need for handmade features.

For real-time HAR, lightweight architectures have also been investigated. MobileNet, a depth-wise separable CNN that preserves accuracy while lowering computational complexity, was first presented by Howard et al. [5]. This architecture is appropriate for mobile apps since it makes HAR deployment on edge devices efficient. EfficientNet, which optimises model scaling for better performance with fewer parameters, was also proposed by Tan and Le [11]. In HAR applications, EfficientNet’s compound scaling strategy has shown promise in striking a balance between accuracy and efficiency.

Advances in HAR have also been facilitated by temporal modeling techniques. Long Short-Term Memory (LSTM) networks, created by Hochreiter and Schmidhuber [12], are excellent at processing sequential input because they maintain long-term dependencies. With sensor-based inputs, LSTMs have been used extensively in HAR. Gated Recurrent Units (GRUs), which Cho et al. [14] developed, provide a more computationally efficient option to LSTMs while retaining robust sequence modeling capabilities. In real-time HAR applications where low latency is essential, GRUs have proven invaluable.

Implementing HAR has been made easier by deep learning frameworks. Abadi et al. [13] presented TensorFlow, which offers scalable deep learning solutions that support a variety of HAR models. PyTorch, created by Paszke et al. [15], has become well-known because of its dynamic computing graph and simplicity of use. By facilitating the quick creation and implementation of deep learning models across multiple domains, these frameworks have expedited HAR research. Researchers have also looked into multimodal ways for HAR. To improve recognition robustness, Nweke et al. [4] used sensor fusion systems that integrate inertial and visual sensor data. Their results showed how well various data sources may be combined to provide a thorough activity analysis. Sports analytics, healthcare monitoring, and smart surroundings have all embraced multimodal learning.

The significance of deep learning in furthering HAR is highlighted in the studied literature. Recurrent networks, Transformers, and 3D CNNs have all greatly increased recognition accuracy. Multimodal techniques improve resilience, while lightweight designs enable real-time applications. Further advancements in HAR will be fuelled by the ongoing development of deep learning models, optimisation strategies, and scalable frameworks, which will increase the systems’ accuracy, efficiency, and adaptability for real-world situations. The research study emphasises how deep learning has a revolutionary effect on HAR. Activity classification has been improved by models like C3D, I3D, and SlowFast, while new paradigms for motion identification have been brought about by transformer-based techniques. Mobile applications have been made possible by lightweight architectures, and resilience has been enhanced by multimodal fusion. Deep learning’s continuous development in HAR propels improvements, increasing the effectiveness, scalability, and accuracy of activity detection in practical settings.

Notwithstanding notable advancements, real-time HAR still faces difficulties with model interpretability, occlusion resilience, and generalisation to actions that are not visible. In order to increase decision-making transparency and guarantee dependability in crucial applications like healthcare and surveillance, future research should concentrate on explainable AI techniques. Furthermore, HAR models can use reinforcement learning and can improve adaptability by enhancing and enabling systems to learn from user interactions and improve performance over time. Advancements in hardware acceleration, such as Tensor Processing Units (TPUs) and GPU optimizations, will further enhance HAR efficiency and Improve its range of scope.
1. Research Design:
METHODOLOGY
- Detection Accuracy: Achieves 52.7% mAP on the
  The foundation of this study is a real-time human activity recognition framework that combines X3D-M (3D CNN) for action categorisation, DeepSORT for tracking, and YOLOv8 for person detection. Multiple people in a video feed can be handled by the model, which can track their motions continuously and accurately identify activities of individual human.
  
  A number of optimisation strategies, including as multi- threaded processing and TensorRT acceleration, are used to guarantee effective detection, tracking, and classification. The system is designed to function in real-time with low latency, which makes it appropriate for uses like sports analytics, healthcare monitoring, and surveillance.
  
  The research process comprises several main stages:
  1. Dataset Selection, Preparation, and processing Dataset Selection
    The 400 action classes in the Kinetics-400 dataset, which span everyday activities, sports, and interpersonal interactions, make it a good choice for human activity recognition. With 300,000 10-second video clips taken from YouTube, it guarantees a range of lighting conditions, camera angles, and motion patterns for reliable model training.
    
    Dataset Preparation and processing
    
    Training and validation sets are created from the dataset, and preprocessing procedures include:
    - Frame Extraction: For temporal consistency, videos were sampled at 16 frames per second.
    - Resizing & Normalization: Frames resized to 224Ã—224 pixels, pixel values normalized.
    - Data Augmentation: To improve generalization, use random cropping, flipping, and colour jittering.
    - Temporal Segmentation: To record motion, clips are split up into frame sequences.
      These procedures guarantee the best possible input quality for X3D-M model training, increasing real-time action recognition accuracy.
  2. Model Training
  Our methodology is a hybrid approach that combines X3D-M for action recognition, DeepSORT for tracking, and YOLOv8 for human identification.
  1. Human Detection YOLOv8
    YOLOv8 is employed for human detection due to its high inference speed and superior accuracy, making it well-suited for real-time applications. The model’s improved detection head and backbone increase the accuracy of feature extraction and localisation. While maintaining a notably faster inference performance, it surpasses previous iterations such as YOLOv5 (mAP of 50.3%) with a mean Average Precision (mAP) of 52.7% on the COCO dataset. Because of its streamlined design and support for TensorRT acceleration, which ensures efficient real- time processing, YOLOv8 outperforms more traditional object identification models like Faster R-CNN and SSD.
    
    IJERTV14IS040023
    
    COCO dataset, outperforming YOLOv5 (50.3% mAP).
    - Inference Speed: Operates at 150+ FPS on an NVIDIA RTX 3090, ensuring real-time detection.
    - Model Efficiency: Provides a better speed-accuracy tradeoff than Faster R-CNN and SSD.
    - Optimization: Supports TensorRT for lower latency and enhanced computational efficiency.
    - Robustness: Maintains high detection performance under varying lighting conditions and occlusions.
  2. Multi-Person Tracking DeepSORT
    DeepSORT is used for reliable multi-person tracking, which preserves track continuity across successive frames by combining CNN-based feature embedding model with Kalman filtering. By including deep appearance features, decreasing identity switches, and enhancing tracking accuracy in dynamic contexts, it outperforms its predecessor, SORT. DeepSORT uses both motion and appearance cues, which makes it more resistant to occlusions and re-identifications than conventional motion- based tracking techniques like ByteTrack, particularly in packed settings.
    - Tracking Accuracy: Outperforms SORT (55.8% IDF1) with an IDF1 score of 64.8% on the MOT17 dataset.
    - Identity Switch Reduction: Improves track continuity by reducing identity switches by 34% when compared to SORT.
    - Using a CNN-based feature embedding approach, apearance- based tracking increases resilience in busy and obscured surroundings.
    - In contrast to ByteTrack: DeepSORT incorporates appearance information to improve re-identification performance, in contrast to ByteTrack, which solely uses motion cues.
    - Real-Time Feasibility: Designed to work well on GPU- accelerated hardware and maintain good tracking accuracy in real-time applications.
  3. Activity Recognition X3D-M
  X3D-M is used for activity recognition, effectively capturing spatiotemporal information over several frames by utilising an enlarged 3D CNN architecture. It is ideal for real-time applications because it strikes a balance between high classification accuracy and computing economy. X3D-M maintains high performance in identifying human activities while optimizing resource utilisation by dynamically extending the network’s depth, width, and temporal resolution.
  - Classification Accuracy: Outperforms I3D (71.9%) and SlowFast (74.7%) with a top-1 accuracy of 75.1% on the Kinetics-400 dataset.
  - Computational Efficiency: Significantly lowers computational overhead by using 6.1Ã— fewer FLOPs than I3D.
  - Real-Time Feasibility: X3D-M offers a better balance between accuracy and efficiency than TSM and SlowFast, guaranteeing quicker inference without compromising recognition performance.
  - Motion-based activity classification is enhanced by spatiotemporal modelling, which makes use of 3D convolution layers to capture both spatial and temporal dynamics.
  The combined approach ensures high-speed, accurate, and robust human detection, tracking, and activity recognition in real-time.
  1. Proposed Design & Evaluation Metrics
    1. Proposed Design
      The suggested approach combines X3D-M for activity recognition, DeepSORT for multi-person tracking, and YOLOv8 for human identification to create a multi-stage framework for real-time human activity recognition. The selection of YOLOv8 is based on its great detection accuracy, which achieved a mean Average Precision (mAP) of 52.7% on the COCO dataset, and its capacity to do high-speed inference, operating at over 150 FPS on an NVIDIA RTX 3090.
      
      The benefits of TensorRT optimization combined with this performance, which beats YOLOv5 (50.3% mAP), enable YOLOv8 to outperform more conventional models like Faster R-CNN and SSD in terms of speed while retaining competitive accuracy.
      
      A summary of the detection performance is provided in Table 1:
      
      Table1: Comparison of YOLOv8 with other detection models in terms of accuracy and speed.
      
      DeepSORT is used for tracking, combining CNN-based feature embeddings with Kalman filtering to ensure reliable multi- person tracking. This method outperforms SORT (55.8% IDF1) by minimising identity switches by 34%, and it gets an IDF1 score of 64.8% on the MOT17 dataset (a considerable reduction). DeepSORT’s use of appearance-based tracking enhances resistance to occlusions and re-identifications in contrast to techniques that just use motion cues. A summary of the tracking performance can be seen in Table 2:-
      
      Table 2: Performance comparison of DeepSORT with other tracking models.
      
      The X3D-M model, an enlarged 3D CNN that effectively captures spatiotemporal characteristics, is used for activity recognition. On the Kinetics-400 dataset, X3D-M surpasses I3D (71.9%) and SlowFast (74.7%) with a top-1 accuracy of 75.1%, requiring 6.1Ã— fewer FLOPs than I3D. Without sacrificing classification performance, X3D-M’s effective spatiotemporal processing makes it ideal for real-time applications. Table 3 provides specifics of the comparison:
      
      Table 3: Accuracy and efficiency comparison of 3D CNN models for activity recognition.
      
      The benefits of TensorRT optimization combined with this performance, which beats YOLOv5 (50.3% mAP), enable YOLOv8 to outperform more conventional models like Faster R-CNN and SSD in terms of speed while retaining competitive accuracy.
      
      Figure 1: The Flow chart for the execution of the proposed solution.
      
      The detection, tracking, and classification modules of the suggested real-time human activity recognition system are integrated into a multi-stage pipeline. The system first records a stream of input video, from which individual frames are taken. The YOLOv8 module then processes these frames to identify human beings and provide bounding boxes. The DeepSORT algorithm receives the detection output and uses it to preserve tracking continuity across frames and provide each individual a unique identity.
      
      Each tracked person’s 16-frame buffered sequence is then sent to the X3D-M model, which uses the extracted spatiotemporal properties to identify activities. After that, an annotated video stream is created for real-time analysis by superimposing the anticipated action labels on the video frames. The flow diagram that follows offers a clear overview of the complete processing pipeline.
    2. Evaluation Metrics
      Our real-time human activity recognition system is evaluated using a wide range of metrics covering object detection, multi- person tracking, activity recognition and overall system efficiency.
      1. Detection Performance Metrics
        
        Mean Average Precision (mAP):
        YOLOv8 shows strong detection accuracy when comparing predicted bounding boxes with ground-truth labels across a range of IoU thresholds, with a mAP of 52.7% on the COCO dataset.
        
        Frames Per Second (FPS) and Inference Latency: Making sure the system satisfies real-time requirements, the detection module runs at more than 150 FPS on an NVIDIA RTX 3090 with an average inference latency of roughly 6.5 ms per frame.
      2. Tracking Performance Metric:
        
        IDF1 Score:
        The IDF1 score compares the accuracy of the tracked detections to the total number of computed and ground-truth detections in order to evaluate the consistency of identification tracking. Even in cluttered settings, robust tracking is shown by a higher IDF1 score in our DeepSORT solution.
        
        Identity Switches (ID Sw.): This metric measures how frequently a previously tracked person is mistakenly given a new identity by the tracking system. Accurately associating activity sequences with specific people requires a tracking technique that is more steady, which is reflected in fewer identity shifts.
      3. Activity Recognition Metrics
        
        Top-1 Accuracy:
        Top-1 Accuracy quantifies the proportion of cases in which the ground truth label and the model’s highest confidence prediction match identically. A high Top-1 Accuracy for our X3D-M model indicates efficient spatiotemporal feature extraction and human action classification.
        
        Top-5 Accuracy:
        This measure establishes the percentage of cases when the model’s top five predictions include the proper action label. It sheds light on how robust the model is, particularly when handling comparable or unclear tasks.
        
        Inference Latency:
        This measures how long it takes for each person to analyse and categorise an action. Low latency is essential for the system’s real-time deployment since it guarantees quick decision-making and prompt action in real-world situations.
      4. System Efficiency Metrics Computational Cost:
        
        This measure counts the amount of floating-point operations needed for inference, which quantifies the computational complexity. When implemented on devices with limited resources, a system withlower FLOPs is more effective and more appropriate for real-time applications.
        Vol. 14 Issue 04, April-2025
        
        Memory Usage:
        Memory Usage assesses the model’s operational footprint. For the system to be implemented in settings with constrained hardware resources without sacrificing performance, memory usage must be optimised.
        
        These metrics collectively provide a robust framework for evaluating both the performance and efficiency of the proposed human activity recognition system, ensuring it meets the stringent requirements of high accuracy and real-time operation.
RESULTS AND DISCUSSION
The proposed system was assessed in a number of areas, such as real-time efficiency, activity recognition capability, tracking stability, and overall detection accuracy. The efficiency of combining YOLOv8, DeepSORT, and X3D-M into a single, reliable pipeline is demonstrated by the findings that follow.

Together with the module-specific outcomes covered in the section on proposed design, Table 4 offers a comprehensive summary of the system’s main performance indicators. This synopsis shows how detection, tracking, and classification are balanced while preserving real-time performance.

Table 4: Overall System Performance Metrics

The suggested system’s real-time performance is contrasted with baseline setups that make use of other model combinations in Table 5. The benefits of our integrated approach in terms of accuracy and processing economy are demonstrated by this comparison.

Table 5: Real-Time Performance Comparison

The experimental results confirm that the integration of YOLOv8, DeepSORT, and X3D-M yields a system capable of robust, real-time human activity recognition. YOLOv8s high mAP and rapid inference enable reliable detection even under challenging conditions. DeepSORT further enhances the system by ensuring stable multi-person tracking, reducing identity switches significantly compared to simpler tracking methods. Finally, X3D-Ms efficient 3D spatiotemporal processing delivers superior activity recognition accuracy with reduced computational demands.

International Journal of Engineering Research & Technology

ISSN: 2278-0181

Vol. 14 Issue 04, April-2025

Table 6: 10-Center clip testing is utilised for expanded instances on K400-val. For a single clip input, we display the top-1 and top-5 classification accuracy (%) along with the computational complexity expressed in GFLOPs (floating-point operations), which are expressed as # of multiply-adds Ã—109. Because each movie uses a fixed number of 10 clips, the computational cost of inference time is proportional to 10Ã— of this.

Figure2: Training Curves for X3D-M

A line plot showing the training loss versus validation loss and accuracy curves over epochs for the X3D-M model. This figure illustrates the convergence behavior and stability during training.

Figure3: Image demonstrate the combined working of Yolov8, Deepsort, X3D-M models.

Figure4 : Output for the combined implementation of all the models on Nvidia T4 GPU for multiple people.

The integrated pipeline also demonstrates remarkable robustness under diverse conditions. The system consistently maintains excellent detection and classification accuracy even in difficult situations including occlusions and fast movements, according to extensive testing. While the outputs on the Nvidia T4 GPU (Figure 3) show consistent performance across different users in real time, the training curves (Figure 1) highlight a reliable convergence of the X3D-M model. The system reduces identity switches and false positives by skilfully fusing the high mAP of YOLOv8 with the steady tracking of DeepSORT and the effective spatiotemporal analysis of X3D-M, guaranteeing dependable action recognition even in dynamic or congested environments.

Additionally, the system’s minimal computing overhead and real-time efficiency make it a great option for deployment in contexts with limited resources, like interactive settings, healthcare monitoring, and surveillance. With the use of 10- center clip testing, the inference cost grows predictably, as indicated in Table 6, guaranteeing that the processing needs stay controllable even as the number of input clips rises. The system’s suitability for real-world applications is confirmed by this balance between detection, tracking, and classification efficiency. It also creates opportunities for future improvements, like additional model compression or quantisation, to achieve even lower latency without compromising accuracy.

Figure5: Performance of Yolov8 for object detection.

Collectively, these results demonstrate that our system not only achieves high accuracy in detection, tracking, and classification but also meets the stringent requirements for real-time operation. This makes it particularly well-suited for practical applications such as surveillance, healthcare monitoring, and interactive environments.

Published by : http://www.ijert.org

International Journal of Engineering Research & Technology (IJERT)

ISSN: 2278-0181

Vol. 14 Issue 04, April-2025
LIMITATIONS AND FUTURE WORK
1. 1. 1. - Limitations:
        Dependency on Detection Quality: YOLOv8’s detection is crucial to the system’s operation. The manual threshold setting may not generalize well and identification mistakes may spread under difficult situations (e.g., occlusions, low lighting).
        
        Limited Tracking Robustness: DeepSORT can still experience identity shifts in crowded environments, which can disrupt the continuity of activity recognition even though it successfully maintains tracking.
        
        Fixed Temporal Window: The model’s capacity to record longer or more intricate actions is constrained by processing fixed-length (16-frame) sequences.
        
        Problems with Generalization: Although the system does well on benchmark datasets, it might have trouble generalizing to a variety of real-world situations with different settings and kinds of activities.
        
        Future work:
Multi-Modal Data Integration: To improve recognition robustness, include extra data modalities (such as audio and sensor data).

Adaptive Sequence Modeling: To capture longer temporal dependencies, investigate recurrent architectures (such as transformers or LSTMs) or variable-length sequence processing.

Advanced Tracking Techniques: To minimize identity shifts, look into hybrid tracking strategies that combine transformer- based methods with DeepSORT model.

Transfer Learning and Domain Adaptation: Make use of transfer learning techniques to shorten training times and enhance generalization across various datasets.
CONCLUSION
Our effort creates an unsupervised recognising system by integrating multiple existing technologies into a single pipeline. We are able to identify the activity being carried out by using X3D-M. We can track several individuals with DeepSORT, and we can identify humans with YoloV8. High mAP, IDF1 scores, and exceptional top-1 accuracy are attained by the suggested approach. Because of these outcomes, the approach may compete with other approaches in tracking, detection, and classification without sacrificing real-time processing. According to the results, the framework can correctly identify intricate human behaviors in a variety of scenarios without the need for costly and sluggish computer vision-based models. Additional research will concentrate on lowering identity switches in crowded settings, boosting detection efficiency under difficult circumstances, and merging various data kinds to enhance generalisation and usability in many real world situations.
REFERENCES

[1] M. P. Reddy, S. Selvam, M. Ac, and A. Na, “Human Activity Reconition using 3D CNN,” ResearchGate, July 2021. DOI: 10.13140/RG.2.2.20520.49923.

[2] M. I. H. Azhar, F. H. K. Zaman, N. M. Tahir, and H. Hashim, “People Tracking System Using DeepSORT,” 2020 10th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), Aug. 2122, 2020. IEEE Xplore, Sep. 24, 2020.

I. Lillo, J. C. Niebles, and A. Soto, Sparse Composition of Body Poses and Atomic Actions for Human Activity Recognition in RGB-D Videos, Image and Vision Computing, vol. 59, pp. 6375, 2019.
Z. Wharton, E. Thomas, B. Debnath, and A. Behera, A Vision-Based Transfer Learning Approach for Recognizing Behavioral Symptoms in People with Dementia, in Proc. 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2018.
N. P. Motwani and S. Soumya, “Human Activities Detection using Deep Learning Technique – YOLOv8,” ITM Web of Conferences, vol. 56, p. 03003, Aug. 2023. DOI: 10.1051/itmconf/20235603003.
J. Cai, X. Tang, and R. Zhong, Silhouettes Based Human Action Recognition by Procrustes Analysis and Fisher Vector Encoding, in Proc. International Conference on Image and Video Processing, and Artificial Intelligence, vol. 10836, International Society for Optics and Photonics, p. 1083612, 2018.
D. Maturana and S. Scherer, VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition, 2019.
Christoph Feichtenhofer and Facebook AI Research (FAIR): X3D: Expanding Architectures for Efficient Video Recognition,2020.
Sameh Neili Boualia; Najoua Essoukri Ben Amara: 3D CNN for Human Action Recognition,DOI: 10.1109, 2021.
Neili Boualia; Wang, Y. Qiao, and X. Tang, Action recognition with trajectory pooleddeep-convolutional descriptors 2015.

[11]A. Krizhevsky, I. Sutskever, and G. E. Hinton, Image net classification with deep convolutional neural networks 2012, pp. 1097-1105.

[12]Karen Simonyan, Andrew Zisserman, Two-Stream ConvolutionalNetworks for Action Recognition in videos