Question about blockdedup and call to count()

In the method [blockdedup](https://github.com/zouzias/spark-lucenerdd/blob/master/src/main/scala/org/zouzias/spark/lucenerdd/LuceneRDD.scala#L574) ,

We have the following code : 

```
 val distinctPartitions = blocked.select(partColumn).distinct().count()
    val hashPart = new HashPartitioner(distinctPartitions.toInt)

val blockedRDD = blocked.rdd
      .keyBy(x => x.getString(x.fieldIndex(partColumn)))
      .partitionBy(hashPart)
```
If I understand correctly , calling .count() will evaluate the dataframe. Wouldn't it be beneficial to persist the dataframe , do the count and then do the `keyBy` ?

Also , why can't we just pass `partitionBy` with the field name?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about blockdedup and call to count() #232

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question about blockdedup and call to count() #232

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions