Install openjdk 8
To install openjdk 8 on debian 10, follow steps mentioned at https://linuxize.com/post/install-java-on-debian-10/
pyspark does not works well with newer versions of Java, so stick with openjdk or oracle java 8.
Install PySpark
Next install pyspark using python pip, I prefer python3 so installation goes with pip3
pip3 install pyspark
Collecting pyspark
Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
100% |████████████████████████████████| 215.7MB 4.0MB/s
Requirement already satisfied: py4j==0.10.7 in /usr/local/lib/python3.7/dist-packages (from pyspark) (0.10.7)
Installing collected packages: pyspark
Running setup.py install for pyspark ... done
Successfully installed pyspark-2.4.4
PySpark brings along required hadoop components
pip3 show -f pyspark | grep hadoop
pyspark/jars/avro-mapred-1.8.2-hadoop2.jar
pyspark/jars/hadoop-annotations-2.7.3.jar
pyspark/jars/hadoop-auth-2.7.3.jar
pyspark/jars/hadoop-client-2.7.3.jar
pyspark/jars/hadoop-common-2.7.3.jar
pyspark/jars/hadoop-hdfs-2.7.3.jar
pyspark/jars/hadoop-mapreduce-client-app-2.7.3.jar
pyspark/jars/hadoop-mapreduce-client-common-2.7.3.jar
pyspark/jars/hadoop-mapreduce-client-core-2.7.3.jar
pyspark/jars/hadoop-mapreduce-client-jobclient-2.7.3.jar
pyspark/jars/hadoop-mapreduce-client-shuffle-2.7.3.jar
pyspark/jars/hadoop-yarn-api-2.7.3.jar
pyspark/jars/hadoop-yarn-client-2.7.3.jar
pyspark/jars/hadoop-yarn-common-2.7.3.jar
pyspark/jars/hadoop-yarn-server-common-2.7.3.jar
pyspark/jars/hadoop-yarn-server-web-proxy-2.7.3.jar
pyspark/jars/parquet-hadoop-1.10.1.jar
pyspark/jars/parquet-hadoop-bundle-1.6.0.jar
Basic environment config
Set PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON pyspark env parameters in /etc/profile
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
export PYSPARK_PYTHON=/usr/bin/python3
Verify Installation
manish@scorpio:~$ python3
Python 3.7.3 (default, Apr 3 2019, 05:39:12)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> from pyspark import SparkContext
>>> sc =SparkContext()
19/12/06 11:07:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>>
You can ignore above warning messages.
Let us read a csv file
Here is our sample CSV saved in sample.csv file
City,State,Temp,Humidity
Gurgaon,HR,22,89
Delhi,DL,25,78
Blr,KA,20,90
Mum,MH,40,99
Jaipur,RJ,45,47
Kochi,KL,30,55
Python code to read data from above CSV
#!/usr/bin/python3
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext()
sql = SQLContext(sc)
df = (sql.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("sample.csv"))
df.show()
Upon executing you should see content of the file
+-------+-----+----+--------+
| City|State|Temp|Humidity|
+-------+-----+----+--------+
|Gurgaon| HR| 22| 89|
| Delhi| DL| 25| 78|
| Blr| KA| 20| 90|
| Mum| MH| 40| 99|
| Jaipur| RJ| 45| 47|
| Kochi| KL| 30| 55|
+-------+-----+----+--------+