Apache Hadoop Ecosystem

Raullien Roos Geingos; Joyline Bildad; Ravi Prakash Chaturvedi; Annu Mishra; Rajneesh Kumar Singh

doi:10.17577/IJERTV14IS030225

Volume 14, Issue 03 (March 2025)

Apache Hadoop Ecosystem

DOI : 10.17577/IJERTV14IS030225

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 55
Authors : Raullien Roos Geingos, Joyline Bildad, Ravi Prakash Chaturvedi, Annu Mishra, Rajneesh Kumar Singh
Paper ID : IJERTV14IS030225
Volume & Issue : Volume 14, Issue 03 (March 2025)
Published (First Online): 05-04-2025
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Apache Hadoop Ecosystem

Raullien Roos Geingos Department of Computer Science And Application

Sharda University Greater Noida, Uttar Pradesh,

India

Ravi Prakash Chaturvedi Department of Computer Science And Application

Sharda University Greater Noida, Uttar Pradesh,

India

Joyline Bildad Department of Computer Science And Application Sharda University

Greater Noida, Uttar Pradesh, India

Annu Mishra

Department of Computer Science And Application

Sharda University Greater Noida, Uttar Pradesh,

India

Rajneesh Kumar Singh Department of Computer Science And Application

Sharda University Greater Noida, Uttar Pradesh,

India

Abstract: The proliferation of user-generated data brought about by the quick development of social media sites like Instagram has created both opportunities and difficulties for data-driven insights. Such large-scale datasets are difficult for traditional data processing techniques to effectively handle in terms of volume, velocity, and variety. In order to handle and analyse Instagram engagement data at scale, this study uses the Apache Hadoop ecosystem to construct a distributed data analytics system.

The study uses a five-node Hadoop cluster with HDFS (Hadoop Distributed File System) for scalable storage and PySpark (Apache Spark for Python) for effective data processing. The study looks at important engagement indicators like likes, comments, post popularity, and sentiment analysis while concentrating on data ingestion, preprocessing, feature extraction, and advanced analytics. Deeper insights into user interaction patterns and the best content strategies are made possible by the system's usage of distributed computing, which allows for both batch and real-time analysis of big datasets. The results show that Hadoop and PySpark provide a scalable and effective social media analytics solution, which makes them appropriate for uses like content performance optimisation, influencer marketing, and tailored recommendations. By demonstrating how open-source distributed computing frameworks may be used to extract valuable insights from enormous social media datasets, this study advances the area of big data analytics.

Keywords: Big Data Analytics, Apache Hadoop Ecosystem, PySpark, HDFS (Hadoop Distributed File System), Distributed Computing, Social Media Analytics, Instagram Data Analysis, Sentiment Analysis, Real-time Data Processing, Batch Processing.

INTRODUCTION

Social media sites like Instagram produce enormous volumes of user interaction data every day in the big data era. Likes, comments, shares, and following activity are examples of interactions that provide important information on audience behaviour, content popularity, and user engagement trends. However, there are many obstacles to overcome in terms of processing, storage, and real-time analytics when analysing such vast amounts of data. Because of their computing constraints and scalability issues, traditional data processing technologies find it difficult to effectively address these issues. In order to tackle these problems, this study uses the Apache Hadoop ecosystem to create a distributed data processing system that makes use of PySpark (Apache Spark for Python) for effective data processing and HDFS (Hadoop Distributed File System) for storage. In order to provide profound insights into post performance, user interaction patterns, and the best content tactics, the objective is to analyse Instagram engagement metrics across a five-node Hadoop cluster.

In order to ensure scalable and effective management of massive datasets, this study focusses on data import, preprocessing, feature extraction, and analytics using PySpark. With the goal of offering practical insights for social media strategy optimisation, the study investigates important performance metrics such likes per post, sentiment analysis of comments, and engagement trends.

This study shows how well the Hadoop ecosystem works for social media analytics by putting in place a fully distributed environment that provides a scalable solution for real-time insights. This study makes a significant contribution to both academic and industrial applications since its conclusions can be used in personalised recommendation systems, influencer marketing, and content strategy optimisation.
1. LITERATURE SURVEY
  1. Big Data Analytics in Social Media
    
    Traditional data processing techniques are insufficient due to the massive amount of user interaction data generated by social media sites such as Instagram. Big data analytics tools are becoming crucial for deriving valuable insights from unstructured social media data, allowing researchers and organisations to efficiently examine user behaviour and engagement trends (Gandomi and Haider, 2015)[1]. Scalable data processing in social media research has been made easier by the inclusion of distributed computing frameworks like Hadoop and Spark (Hashem et al., 2015)[2].
  2. The Apache Hadoop Ecosystem for Large-Scale Data Processing
    
    Because of its distributed architecture, scalability, and fault tolerance, Apache Hadoop has been widely used for big data processing and storage (White, 2012)[3]. MapReduce offers a programming style for parallel processing, whereas Hadoop's HDFS (Hadoop Distributed File System) allows for the dependable storing of big datasets over numerous nodes (Dean & Ghemawat, 2008)[4]. MapReduce is less appropriate for real-time data analysis due to its high latency.
  3. Apache Spark for Efficient Data Processing
    
    Apache Spark was presented as a quick, in-memory distributed computing platform to overcome MapReduce's performance issues (Zaharia et al., 2010)[5]. Spark is perfect for machine learning and large-scale data analytics because of its efficient iterative computation capabilities (Meng et al., 2016)[6]. Researchers have discovered that PySpark (Spark's Python API) improves data processing capabilities in the context of social media analytics, allowing for both batch and real-time monitoring of engagement metrics (Lamba & Garg, 2017)[7].
  4. Instagram Data Analysis and User Engagement
    
    In order to find patterns in user engagement, sentiment analysis, and influencer impact, recent research have looked at Instagram data. The impact of Instagram likes, comments, and hashtags on content exposure and engagement rates was investigated by Rahman et al. in 2021[8]. Similar methods were used by Tifrea et al. (2020)[9] to categorise Instagram material according to user interactions using machine learning techniques. These studies emphasise how crucial scalable data analytics tools are to extracting useful information from Instagram data.
  5. Applications of Hadoop & Spark in Social Media Analytics
    
    Data from social media may be processed efficiently at scale using Hadoop and Spark, according to research. Tiwari et al. (2019)[10] showed how Hadoop can be used for sentiment analysis on Twitter and came to the conclusion that distributed computing is far more efficient than standard relational databases. In a similar vein, Dhyani & Patel (2021)[11] used Spark-based data processing to examine Facebook user interactions, leading to better scalability and
    
    quicker query execution times. The efficacy of the Hadoop ecosystem for extensive social media analytics is confirmed by these results.
  6. Research Gap and Contribution
  Although big data analytics for social media has been studied in the past, most of these studies have been on Twitter and Facebook, with little research on Instagram data analysis utilising Hadoop and Spark. By using a five-node Hadoop cluster to analyse Instagram engagement metrics, this study seeks to close that gap. This study advances the fields of social media analytics, big data processing, and influencer marketing insights by utilising HDFS for scalable storage and PySpark for effective processing.
2. PySpark for Big Data Processing
  - The main processing engine for Instagram data analysis is Apache Spark, more especially its Python API (PySpark). PySpark enables in-memory processing, which greatly increases computational efficiency in contrast to standard Hadoop MapReduce's high latency. PySpark is in charge of a number of data analysis and transformation activities, such as:
  - Data Ingestion: Loading raw Instagram data from HDFS.
  - Data Cleaning: Handling missing values, duplicate records, and timestamp conversion.
  - Feature Extraction: Calculating engagement metrics such as likes per post, comment frequency, peak activity hours, and sentiment scores.
  - Data Aggregation: Summarizing user interactions over time to identify trends.
  1. Spark SQL for Structured Query Processing
    - Structured data querying using SQL-like commands is made possible by the integration of Spark SQL into the system. It makes working with structured datasets easy and enables sophisticated analytical queries on Instagram engagement numbers. Additionally, Spark SQL improves data interchange by supporting many formats, including CSV, JSON, and Parquet.
  2. Resilient Distributed Datasets (RDDs) & DataFrames
    
    PySpark processes data using RDDs (Resilient Distributed Datasets) and DataFrames:
    - RDDs: Handle unstructured and semi-structured Instagram data, such as raw JSON files containing post details.
    - DataFrames: Manage structured engagement data, enabling faster query execution and advanced analytics.
  3. Feature Engineering and Sentiment Analysis
    
    The system extracts key engagement metrics and performs sentiment analysis on Instagram comments. Feature extraction includes:
    - Total Engagement Rate: (Likes + Comments) per post.
    - Peak Engagement Time: Identifying optimal posting hours.
    - Hashtag Effectiveness: Analyzing hashtag influence on post visibility.
    - Sentiment Analysis: Classifying comments as positive, neutral, or negative using Natural Language Processing (NLP) techniques.
  4. Processed Data Storage and Visualization
    
    Once the data is processed, it is stored in multiple formats for further analysis and visualization:
    - HDFS: Retains the processed data for scalability and reusability.
    - External Databases (e.g., PostgreSQL, MongoDB):
      
      Used for structured query-based analysis.
    - Data Export (CSV/JSON): Enables integration with visualization tools like Tableau, Matplotlib, or Power BI.
  5. Data Visualization for Insights
    
    The extracted insights are visualized using charts, graphs, and reports to provide a clear understanding of Instagram engagement trends. This step helps identify:
    - Most engaging posts based on likes and comments.
    - Optimal posting times for maximum audience interaction.
    - Impact of hashtags on content visibility.
    - Sentiment trends in user comments.
  6. Cluster Management and Performance Optimization To ensure smooth execution, the system incorporates:
  - Resource Optimization: Dynamic allocation of CPU, memory, and disk usage across the cluster.
  - Parallel Processing: Distributing computation tasks across worker nodes for higher efficiency.
  - Fault Tolerance: Ensuring data availability by replicating blocks across HDFS nodes.
  The following algorithm outlines the step-by-step process of data collection, preprocessing, analysis, and visualization for Instagram engagement data using the Apache Hadoop ecosystem.
  
  Algorithm Starts:
  
  Hadoop-Based Instagram Data Analytics
  
  Input: Raw Instagram dataset (likes, comments, hashtags, timestamps, follower counts)
  
  Output: Processed insights (engagement trends, sentiment analysis, best posting times)
  1. Initialize the Hadoop Ecosystem
    - Set up the HDFS cluster across five nodes.
    - Configure YARN for resource allocation.
  2. Data Collection & Storage
    - Extract Instagram engagement data via Instagram API or dataset imports.
    - Store the raw data in HDFS in JSON/CSV format.
  3. Data Preprocessing in PySpark
    - Load raw data from HDFS into PySpark.
    - Clean the dataset:
      - Handle missing values (remove or impute).
      - Convert timestamps into structured formats.
      - Normalize text (remove special characters from
        
        hashtags).
    - Store the cleaned data back into HDFS.
  4. Feature Extraction & Transformation
    - Compute engagement metrics:
      - Total Engagement = Likes + Comments per post.
      - Engagement Rate = (Likes + Comments) / Followers.
        
        Peak Activity Time = Most active hours for user
        
        interactions.
    - Perform Sentiment Analysis on comments (optional).
  5. Data Storage for Querying
    - Store processed data in HDFS for further computation.
    - Export structured data to external databases (PostgreSQL, MongoDB) for visualization.
  6. Data Analysis & Visualization
    - Use PySpark SQL for querying processed data.
    - Generate visual reports using Matplotlib, Tableau, or Power BI.
    - Identify key trends, best posting times, and audience sentiment.
  7. Performance Optimization
    - Optimize YARN resource allocation for better parallel processing.
    - Ensure fault tolerance by replicating data across nodes.
  8. Output Insights & Decision Making
    - Generate final reports on:
  - Most engaging Instagram posts.
  - Best-performing hashtags.
  - Optimal posting schedule.
  - Sentiment trends from user comments.
  - End of Algorithm
  START
  
  FLOWCHART:
  
  DATA COLLECTION & STORAGE

DATA PREPROCESSING

FEATURE EXTRACTION

DATA STORAGE FOR QUERYING

PERFORMANCE OPTIMISATION

OUTPUT & DECISION MAKING

END

Fig. 1 Flowchart of Instagram Data Analytic using Hadoop & PySpark

Flowchart Explanation:

Start: Initialising the Hadoop ecosystem is the first step in the process.
Data Collection & Storage: Data about Instagram engagement is gathered and kept in HDFS.
Data Preprocessing: PySpark is used for data transformation, cleansing, and organising.
Feature Extraction: Sentiment analysis and key engagement metrics are carried out.
Data Storage for Querying: Both external databases and HDFS store processed data.
Data Analysis & Visualization: Visualisation tools and queries are used to generate insights.
Performance Optimization: Through the use of YARN and parallel processing, the system guarantees efficiency.
Output & Decision Making: Trends and final reports are produced.

HADOOP WORKING

Planning the Project and Collecting Requirements
1. Roadmap for Hadoop-based Instagram Data Analytics
  
  Setting Up the Environment and Configuring the Cluster
  
  Setting up the Hadoop Cluster
  
  Gathering and Storing Data
  
  Collecting & storing Instagram data
  
  Preprocessing Data & Feature engineering
  
  Data cleaning and preprocessing
  
  Performing data preprocessing & analytics
  
  Analyzing and processing data
  
  Querying & Data Storage
  
  Storing the processed data
  
  Generation of insights & Visualization
  
  Visualizations and generating insights
  
  Checking Fault Tolerance & performance optimization
  
  Performance optimization
  
  Testing & Deployment
  
  Testing the system & deploying
  
  Feature enhancement & Conclusion
  
  Enhancing the system and Document
  
  fig. 2 Roadmap
HADOOP ENVIRONMENT SETUP

In order to implement a distributed data processing system for Instagram data analysis, it is essential to first set up a Hadoop environment. Hadoop distributes data across multiple nodes in a cluster, allowing for scalable, fault- tolerant, and parallel data processing. This section offers a step-by-step tutorial for configuring a five-node Hadoop cluster, with a MacBook serving as the NameNode (Master) and four physical machines serving as DataNodes (Workers).
1. System Requirements:
  
  Before I install the Hadoop, I made sure all the nodes in the cluster met the following hardware and software requirements:
  1. Hardware requirements:
    - Master node (in this case the MacBook – NameNode)
      - CPU: it had a multi-core processor.
      - RAM: 8GB or more.
      - Storage: SSD with a least 100GB freeSpace
    - Worker node (4 physical machines – DataNodes)
      - CPU: must have multi-core processor.
      - RAM: 4GB or more.
      - Storage: at least 50GB per node
  2. Software requirements:
    - MacBook (Master Node): macOS with MacPorts installed
    - Worker Nodes: Any Unix-based OS (in this case macOS)
    - Java Development Kit (JDK 8 or higher)
    - SSH (Secure Shell) for password-less communication
    - Hadoop 3.x (latest stable version)
2. Installation and Configuration:
Step 1: Install Java (JDK or Higher)
- Installing Java on my MacBook using MacPorts since Hadoop needs it:
  
  sudo port install openjdk11
  
  Verifying the installation:
  
  java -version
  
  Step 2: Installing Hadoop on the Master node (MacBook)
- Using MacPorts, we installed Hadoop:
  
  sudo port install hadoop
  
  Verifying the installation:
  
  hadoop version
  
  Step 3: We configured SSH for password-less Authentication:
- Secure password-free SSH access between the nodes is necessary for Hadoop. So we set up SSH keys on the master node.
- ssh-keygen -t rsa
  
  ssh-copy-id username@datanode1 ssh-copy-id username@datanode2 ssh-copy-id username@datanode3 ssh-copy-id username@datanode4
  
  Verifying the connectivity:
  
  ssh datanode1
  
  Step 4: We modified the following Hadoop configuration files:
  - hadoop-env.sh (Setting Java home)
    
    export JAVA_HOME=/Library/Java/JavaVirtualMachines/openjdk 11/Contents/Home
  - core-site.xml (Properties of HDFS )
    
    <configuration>
    
    <property>
    
    <name>fs.defaultFS</name>
    
    <value>hdfs://master:9000</value>
    
    </property>
    
    </configuration>
  - hdfs-site.xml (Settings for storage replication)
    
    <configuration>
    
    <property>
    
    <name>dfs.replication</name>
    
    <value>3</value>
    
    </property>
    
    </configuration>
  - yarn-site.xml (Settings for resource manager)
    
    <configuration>
    
    <roperty>
    
    <name>yarn.resourcemanager.hostname</name>
    
    <value>master</value>
    
    </property>
    
    </configuration>
  - slaves (Worker nodes list)
  datanode1 datanode2 datanode3 datanode4
  
  Step 5: Formatting the Hadoop Distributed File System (HDFS):
- We need to format the name node before starting the hadoop.
hdfs namenode -format

Step 6: Starting Hadoop Services:

start-dfs.sh start-yarn.sh

Verifying the running process:

jps

Step 7: Verifying the Hadoop cluster

Vol. 14 Issue 03, March-2025

START

FLOWCHART OF THE ENVIRONMENT SET-UP:

Install Java (JDK or Higher)

Installing Hadoop using MacPorts

We configured SSH for password-less Authentication

Then we modify the Hadoop Configuration Files

Formatting HDFS

Starting Hadoop Services

Verifying the Hadoop cluster

Hadoop Set-Up completed
fig. 3 hadoop environment setup

6. CONCLUSION

A strong big data analytics pipeline for Instagram data is ensured by the combination of HDFS for scalable storage, PySpark for high-performance computing, and visualisation tools for insight production. This architecture adds to the larger field of big data and influencer marketing analytics by offering a scalable, effective, and economical solution for social media engagement analysis.

Vol. 14 Issue 03, March-2025

8. ACKNOWLEDGMENTS

We would like to extend our deepest gratitude to our guide Dr. Ravi Prakesh for his valuable guidance.

9. REFERENCES

Haider, M., and Gandomi, A. (2015). Beyond the Hype: Analytics, Methods, and Concepts of Big Data. 35(2), 137-144, International Journal of Information Management.
Hashem et al. (2015), I. A. T. Review and Open Research Issues Regarding the Emergence of "Big Data" in Cloud Computing. Systems of Information, 47, 98-115.
T. White (2012). Hadoop: The Complete Manual. Media by O'Reilly.
Ghemawat, S., and Dean, J. (2008). Simplified Data Processing on Big Clusters using MapReduce. ACM Communications, 51(1), 107-113.
M. Zaharia and colleagues (2010). Spark: Using Working Sets for Cluster Computing. Hot Topics in Cloud Computing: Proceedings of the USENIX Conference (HotCloud'10).
Meng, X., and associates (2016). MLlib: Apache Spark Machine Learning. 17(1), 12351241, Journal of Machine Learning Research.
Garg, K., and Lamba, A. (2017). Apache Spark is used for real-time processing in big data analytics. IEEE International Cloud Computing Conference, pp. 129134.
M. S. Rahman et al. (2021). Data-Driven Analysis of Instagram User Engagement Trends. Journal of Social Media Analytics, 9(1), 56-72.
C. Tifrea and colleagues (2020). Classifying Instagram Content with Machine Learning Methods. Social Media Data Science Workshop Proceedings, 142-150.
R. Tiwari and colleagues (2019). Analysis of Twitter Sentiment with Hadoop and Spark. 198210 in Journal of Big Data Research, 6(3).
Patel, P., and V. Dhyani (2021). Big Data Analytics for Facebook User Data Processing with Apache Spark. 7(2), 89-102, International Journal of Data Science.

Apache Hadoop Ecosystem

Planning the Project and Collecting Requirements

Setting Up the Environment and Configuring the Cluster

Gathering and Storing Data

Preprocessing Data & Feature engineering

Performing data preprocessing & analytics

Querying & Data Storage

Generation of insights & Visualization

Checking Fault Tolerance & performance optimization

Testing & Deployment

Feature enhancement & Conclusion

START

Install Java (JDK or Higher)

Installing Hadoop using MacPorts

We configured SSH for password-less Authentication

Then we modify the Hadoop Configuration Files

Formatting HDFS

Starting Hadoop Services

Verifying the Hadoop cluster