java.lang.NegativeArraySizeException
Serialization trace:
SETTLED_TIME (_dw_0400_ld_bonus_XXX_fact_0_1.XXXXStruct)
otherElements (org.apache.spark.util.collection.CompactBuffer)
Solution
You do a combineByKey (so you have a join probably somewhere), which spills on disk because it’s too big. To spill on disk it serializes, and the blocks are > 2GB.
From a 2GB dataset, it’s easy to exand to several TB
Increase parallelism, make sure that your combineByKey has enough different keys, and see what happens.