With the rapid advancement of smart devices and IoT technologies, analyzing human activity data in real-time has become increasingly important in areas such as healthcare, security, and behavioral analysis. Traditional batch-processing methods often fail to handle the unique challenges posed by continuous data streams, making real-time anomaly detection a critical research problem.
This study introduces an AI-driven framework for detecting outliers in human activity streams, leveraging big data technologies and machine learning algorithms to process data in real time. The research addresses the fundamental challenges of streaming data, including high velocity, infinite size, and evolving patterns, and proposes an efficient, scalable, and robust methodology for identifying anomalies as they occur.
Methodology
he proposed solution integrates big data architectures with machine learning-based anomaly detection in a structured pipeline. Below is an overview of the methodology in diagram form:

Implementation Overview
After defining the methodology, the next step is how the tools and concepts were implemented. The diagram below illustrates the practical implementation of each methodology step, showing the real-world tools and technologies used:

to address the challenges of detecting outliers in real-time streaming data. Below is how each tool was selected and contributed to the success of the system:
1. Apache Kafka: Data Collection and Extraction
Role: High-throughput data streaming platform
Contribution:
Apache Kafka was adopted as the event streaming platform for the data collection and extraction phase. Kafka’s capability to handle large volumes of real-time data made it an ideal choice for this project. It allows data from smartphone accelerometers (used to collect human activity data) to be ingested into the system efficiently. Kafka supports high throughput, low latency, and scalability, which are critical for handling continuous streams of activity data.
- Data Flow Efficiency: Kafka decouples pipeline components and ensures minimal message delay without data loss or duplication, ensuring that incoming activity data is streamed consistently and reliably.
- Real-Time Processing: Kafka’s ability to process and store messages in real-time was crucial for maintaining timely outlier detection. It supported the data extraction phase, enabling smooth handoff to the next component of the pipeline (Apache Spark).
By adopting Kafka, the system achieved low latency and fault tolerance, enabling real-time anomaly detection in a high-throughput environment.
2. Apache Spark: Data Processing and Analysis
Role: Distributed stream processing engine
Contribution:
Apache Spark was utilized for the data processing and analysis phase. Spark’s unified stream processing engine empowered the system to execute complex operations and machine learning models in a distributed manner, enabling efficient real-time analytics.
Stream Processing & Fault Tolerance: Spark’s Structured Streaming module was used to process the incoming activity data in real-time. This allowed the system to handle large-scale streaming data efficiently with low-latency. The architecture provided high fault tolerance through data replication, ensuring the system remained operational even in the event of failures.
Machine Learning Integration: Apache Spark’s MLlib library was used to implement the Isolation Forest algorithm for outlier detection. Spark enabled the execution of complex machine learning models on streaming data, performing both training and real-time inference. The distributed nature of Spark allowed the system to handle the computational challenges posed by large datasets, enabling real-time classification of data as normal or anomalous.
Windowing: To process the streaming data effectively, windowing techniques (like sliding windows) were applied using Spark’s windowing function, which grouped data into fixed-size intervals for aggregation and outlier detection. This ensured that the system could detect outliers based on temporal patterns in human activity, improving the model’s accuracy.
Performance Enhancement: Spark’s aggregation features, such as windowed aggregations, enabled the system to maintain high performance by processing data in chunks, reducing the need for expensive recalculations and improving response time for real-time alerts.
Through the use of Spark, the system achieved high effectiveness in processing data streams, performing complex computations, and delivering timely anomaly detection with a high degree of fault tolerance and scalability.
3. Isolation Forest: Outlier Detection Algorithm
Role: Anomaly detection algorithm
Contribution:
The Isolation Forest algorithm was adopted as the core machine learning model for detecting outliers in the streaming data. This model was specifically chosen for its efficiency in handling high-dimensional, imbalanced datasets and its ability to process data streams in an unsupervised fashion.
Robustness to Dimensionality: Isolation Forest is particularly effective in high-dimensional data, such as the accelerometer data used in this research. It isolates anomalies by recursively partitioning the data, which is a scalable approach for real-time outlier detection. This method helped overcome the challenge of curse of dimensionality, enabling faster detection of outliers.
Fast Processing: Isolation Forest was well-suited for real-time processing, where it detected anomalies without the need for training data labels. This made it highly effective for processing continuous streams of activity data in real-time, achieving fast classification of normal vs. anomalous data.
Evaluation Metrics: The model’s performance was evaluated using accuracy, precision, recall, and F1 score. The results showed that the model achieved an accuracy of 97%, recall (0.88), precision (0.94), and F1 score (0.91), validating its suitability for real-time anomaly detection in human activity data.
The combination of the Isolation Forest algorithm and Apache Spark’s distributed processing capabilities led to the development of a robust, fast, and scalable outlier detection system that successfully detected anomalies in real-time, with minimal latency and high accuracy.
4. Real-Time Monitoring Dashboard: System Monitoring and Analysis
Role: Visual tracking of outliers
Contribution:
To ensure effective monitoring and analysis of the system, a real-time monitoring dashboard was developed. This dashboard displayed the occurrences of outliers as they were detected, providing an interface for system administrators and end-users to track and respond to anomalies in human activities.
- Real-Time Alerts: The dashboard was designed to provide instant feedback on the outlier detection process, notifying users immediately when unusual behavior was detected (e.g., unauthorized running in restricted areas).
- Data Access Layer: A messaging system was implemented at the final phase of the pipeline to store results safely for later analysis. This ensured that data could be retrieved without losing information, supporting additional analytical tasks if needed.
Results & Evaluation
The proposed methodology demonstrated significant effectiveness in solving the challenges of real-time outlier detection in human activity data. Key results include:
- Accuracy: The system achieved an impressive 97% accuracy in detecting anomalies in real-time.
- Performance Metrics: The model’s recall (0.88), precision (0.94), and F1 score (0.91) indicated that the system was both efficient and effective in detecting outliers.
- Scalability: The use of Apache Kafka and Apache Spark allowed the system to scale effectively, handling large amounts of streaming data with minimal delay and ensuring high throughput and fault tolerance.
- Timeliness: Real-time anomaly detection was achieved, providing timely alerts for potential threats or abnormal human activity.
The results confirm the feasibility of integrating big data processing frameworks with machine learning models for real-time outlier detection, opening opportunities for applications in security surveillance, healthcare monitoring, and smart environments.
Significance & Contributions
✅ First-of-its-kind methodology integrating Apache Kafka, Apache Spark, and Isolation Forest for real-time human activity anomaly detection.
✅ Innovative use of window-based feature engineering for efficient real-time processing.
✅ Scalable, high-performance pipeline, capable of handling continuous data streams with minimal latency.
✅ Potential applications in multiple domains, including security, healthcare, and behavioral analytics.
📄 Full Research can be Found Here