Scala Notes

I wrote more than 5000 lines of Scala code about 1 year ago, but later on I don’t have any chance to write it. (University doesn’t teach it, previous internships don’t use it)

Now I’m picking it up as the university teaches spark by Scala, but I feel that I forgot lots of not only syntax but also some key knowledge about Scala.

Therefore I want to write some Scala notes.

Generic

1. side effect

####1. map on iterator no side effect: https://stackoverflow.com/questions/12631778/scala-map-on-iterator-does-not-produce-side-effects

> List(1,2,3,4).iterator.map((x: Int) => println(x))

doesn't print

while

> List(1,2,3,4).map((x: Int) => println(x)) 

> List(1,2,3,4).foreach((x: Int) => println(x))

> List(1,2,3,4).iterator.foreach((x: Int) => println(x))

all do

Note:

  • map on iterator is lazy, put “toList” in the end will fix it.
  • but should use foreach, as it is designed for side effects(won’t allow laziness)

2. ==, eq, and equals

  • equals in scala is same as equals in java, throw null exception
  • == check for null, then redirect to equals
  • eq is scala equivalent to == in java

Examples
- 1 equals 2 will return false, as it redirects to Integer.equals(…)
- 1 == 2 will return false, as it redirects to Integer.equals(…)
- 1 eq 2 will not compile, as it requires both arguments to be of type AnyRef
- new ArrayList() equals new ArrayList() will return true, as it checks the content
- new ArrayList() == new ArrayList() will return true, as it redirects to equals(…)
- new ArrayList() eq new ArrayList() will return false, as both arguments are different instances
- foo equals foo will return true, unless foo is null, then will throw a NullPointerException
- foo == foo will return true, even if foo is null
- foo eq foo will return true, since both arguments link to the same reference

3. A very good command line input parse library

import org.rogach.scallop._
https://github.com/scallop/scallop

4. scala split

java split() takes regular expression string as input only: split(String: regex)
example: split by character ‘|’, we must escape ‘|’, split(”\|“)
split(“|”) has a semantic meaning of “split by empty string or split by empty string”
thus, “abcde” will be split to (a, b, c, d, e)

scala override the split() function
Split by character: split(Char: c)
Split by regex: split(String: regex)

5. scala Either, to allow different return types

Noteļ¼š Left and Right are wrappers, extends Either
    val a: Either[Int, String] = {
    if (true) 
        Left(42) // return an Int
    else
        Right("Hello, world") // return a String
    }

Examples:

    val a: Either[org.apache.spark.rdd.RDD[String],  org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]] = {
    if (text) 
        Left(spark.sparkContext.textFile(input_path + "/lineitem.tbl")) // read in text file as rdd
    else
        Right(sparkSession.read.parquet(input_path + "/lineitem").rdd)  //read in parquet file as df, convert to rdd
    }

6. Immutable map and mutable map

https://alvinalexander.com/scala/how-to-add-update-remove-elements-immutable-maps-scala

Spark

1. reducyByKey syntax

pairs.reduceByKey((accumulatedValue: Int, currentValue: Int) => accumulatedValue + currentValue)  

Note:

  • explicit type, need bracket. reduceByKey((x:type, y:type)=> … )

2. read parquet files

import org.apache.spark.sql.SparkSession

val sparkSession = SparkSession.builder.getOrCreate

val lineitemDF = sparkSession.read.parquet("TPC-H-0.1-PARQUET/lineitem")

3. Scala REPL

spark-shell  
scala >   
default: SparkSession as spark  
SparkContext as sc  

4. spark dataframe

1. return rows

head(n = 5), default = 5
get first n row

2. return Unit

show(n = 20), default n = 20
print n rows

show(b = true), default b = true
print maximum 20 characters for each row if b == true, else no limit

describe:
show statistic summary for each column: count, mean, stddev, min, max
Example: https://img-blog.csdn.net/20161012231742058

2. return dataframe

limit(int n)

5. Secondary sorting

http://codingjunkie.net/spark-secondary-sort/

6. union 2 rdd and group by key

rdd1.cogroup(rdd2): (key, iter1, iter2)
https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/rdd/PairRDDFunctions.html

7. sortByKey costumize ordering function

https://www.iteblog.com/archives/1240.html

Questions

Written with StackEdit.