Big Data Processing

Apache Spark Free Big Data Tool

In-Memory Data Processing With Free Big Data Tool

Flexible, Lightweight, and faster unified analytics engine for large-scale data processing. Integrated with Hadoop and supports multiple languages.

Overview

Apache Spark is a free and open source big data processing engine. It is based on Hadoop MapReduce and is designed for fast computation. Apache Spark extends the Hadoop MapReduce model to allow for more types of computations, such as interactive queries and stream processing, to be performed more efficiently. It supports in-memory cluster computing, which boosts an application’s processing speed. Apache Spark handles a variety of workloads including iterative algorithms, interactive queries, and streaming. It comes with out-of-the-box features such as fault tolerance, advanced analytics, lazy evaluation, real-time stream processing, in-memory data processing, and many more.

Over 80 high-level operators are available in Apache Spark, which can be used to create parallel applications. It also includes an API that allows for real-time stream processing. In Apache Spark, all transformations are Lazy in nature. It implies that instead of providing the result immediately, it creates a new RDD from the existing one. As a result, the system’s performance is improved. Apache Spark supports multiple languages like Java, R, Scala, Python whereas Hadoop only supports Java language. Apache Spark allows in-memory processing of tasks that increase massive speed. Apache Spark works well with Hadoop’s HDFS file system and multiple file-formats like parquet, JSON, CSV, ORC. Hadoop can be easily integrated with Apache Spark either as an input data source or destination.

System Requirements

In order to install Apache Spark, you must have the following softwares:

  • Java
  • Scala

Features

Following are the key features of Apache Spark:

  • Free and open source
  • Fast processing speed
  • Flexible and ease to use
  • Real-time stream processing
  • Reusability
  • Fault tolerance
  • Support multiple languages
  • Integrated with Hadoop
  • Cost efficient
  • Advanced analytics
  • In-memory computing

Installation

Install Apache Spark on Ubuntu 18.04

Execute command to download Apache Spark.

$ wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz

Extract the tar file using below command.

$ tar -zxf spark-3.1.1-bin-hadoop3.2.tgz

Move the extracted directory.

$ sudo mv spark-3.1.1-bin-hadoop3.2 /opt/spark

Open .bashrc file and add below lines into it.

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Activate the environment with the following command.

$ source ~/.bashrc

Start the Spark master server.

$ start-master.sh

Open browser and enter http://server-ip:8080 for accessing the web interface.

Explore

You may find the following links relevant:

 English