• Non ci sono risultati.

Part II

N/A
N/A
Protected

Academic year: 2021

Condividi "Part II"

Copied!
14
0
0

Testo completo

(1)

Part II

(2)

 The method cube(col1, .., coln) of the

DataFrame class can be used to create a multi- dimensional cube for the input DataFrame

 On top of which aggregate functions can be computed for each “group”

 The method rollup(col1, .., coln) of the

DataFrame class can be used to create a multi- dimensional rollup for the input DataFrame

 On top of which aggregate functions can be

computed for each “group”

(3)

 Specify which attributes are used to split the input data in “groups” by using cube(col1, .., coln) or rollup(col1, .., coln), respectively

 Then, apply the aggregate functions you want to compute for each group of the cube/rollup

 The result is a DataFrame

(4)

 The same aggregate functions/methods we

already discussed for groupBy can be used

also for cube and rollup

(5)

Create a DataFrame from the purchases.csv file

The first line contains the header

The others lines contain the quantities of purchased products by users

▪ Each line contains userid,productid,quantity

Create a first DataFrame containing the result of the cube method. Define one group for each pair userid, productid and compute the sum of quantity in each group

Create a second DataFrame containing the result of the rollup method. Define one group for each pair

userid, productid and compute the sum of quantity in

each group

(6)

 Input file

userid,productid,quantity u1,p1,10

u1,p1,20

u1,p2,20

u1,p3,10

u2,p1,20

u2,p3,40

u2,p3,30

(7)

 Expected output - cube

userid,productid,sum(quantity)

null null 150

null p1 50

null p2 20

null p3 80

u1 null 60

u1 p1 30

u1 p2 20

u1 p3 10

u2 null 90

u2 p1 20

u2 p3 70

(8)

 Expected output - rollup

userid,productid,sum(quantity)

null null 150

u1 null 60

u1 p1 30

u1 p2 20

u1 p3 10

u2 null 90

u2 p1 20

u2 p3 70

(9)

# Create a Spark Session object

spark = SparkSession.builder.getOrCreate()

# Read purchases.csv and store it in a DataFrame dfPurchases = spark.read.load("purchases.csv",\

format="csv",\

header=True,\

inferSchema=True) dfCube=dfPurchases.\

cube("userid","productid“).agg({"quantity": "sum"}) dfRollup=dfPurchases\

.rollup("userid","productid“)\

.agg({"quantity": "sum"})

(10)

 Similarly to RDDs also DataFrames can be combined by using set transformations

 df1.union(df2)

 df1.intersect(df2)

 df1.subtract(df2)

(11)

 The method explain() can be invoked on a

DataFrame to print on the standard output

the execution plan of the part of the code

that is used to compute the content of the

DataFrame on which explain() is invoked

(12)

 Spark SQL automatically implements a

broadcast version of the join operation if one of the two input DataFrames is small enough to be stored in the main memory of each

executor

(13)

 We can suggest/force it by creating a broadcast version of a DataFrame

 E.g.,

dfPersonLikesBroadcast = dfUidSports\

.join(broadcast(dfPersons),\

dfPersons.uid == dfUidSports.uid)

(14)

 We can suggest/force it by creating a broadcast version of a DataFrame

 E.g.,

dfPersonLikesBroadcast = dfUidSports\

.join(broadcast(dfPersons),\

dfPersons.uid == dfUidSports.uid)

In this case we specify that dfPersons must be broadcasted and hence Spark will execute the join operation by using a broadcast join

Riferimenti

Documenti correlati

We construct an hyperinterpolation formula of degree n in the three-dimensional cube, by using the numerical cubature formula for the product Chebyshev measure given by the product of

The existence, uniqueness and asymptotic representation of its solutions are stated by the Tikhonov theorem [3], while terms of higher order of asymptotic approx- imation are shown

In the case of COVID in Italy, our estimate – based on a fit of the epidemiological data by the model, see Sec- tions VII and VIII below – is that only 10% of infec- tives

18 This clarifies why the ventilated extension from six to ten years in the duration of loans covered by public guarantees could mitigate the problem without solving it. To solve

When a fair coin is tossed n times, let P (n) be the probability that the lengths of all runs (maximal constant strings) in the resulting sequence are of the same parity

Models based on intensity have been elaborated exploiting some stochastic differential equations, for example CIR-short rate models.. That

a. All of these Ans.5. None of these Ans.7.. During the process of combustion ________is given out a. Both heat & light Ans. Ideal fuel has ____ Heating Value a. Heating value

independent of the path for given chemical reaction and reactant mixture composition b.. is highest for constant