In the method blockdedup ,
We have the following code :
val distinctPartitions = blocked.select(partColumn).distinct().count()
val hashPart = new HashPartitioner(distinctPartitions.toInt)
val blockedRDD = blocked.rdd
.keyBy(x => x.getString(x.fieldIndex(partColumn)))
.partitionBy(hashPart)
If I understand correctly , calling .count() will evaluate the dataframe. Wouldn't it be beneficial to persist the dataframe , do the count and then do the keyBy ?
Also , why can't we just pass partitionBy with the field name?
In the method blockdedup ,
We have the following code :
If I understand correctly , calling .count() will evaluate the dataframe. Wouldn't it be beneficial to persist the dataframe , do the count and then do the
keyBy?Also , why can't we just pass
partitionBywith the field name?