Apache Spark Configuration Experiences

Some important spark configuration

--conf spark.driver.memory=90g \
--conf spark.sql.files.maxPartitionBytes=4194304 \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=/opt/airflow/dags/spark-events

Partition configuration
- Input partition
  - control the input partition by using spark.sql.files.maxPartitionBytes
    - default is 128mb, but if you will have explode operation, consider to decrease this value
  - control the shuffle partition by using spark.sql.shuffle.partitions
    - default is 200 partition, but if you will have explode operation, consider to decrease this value
  - control the output partition by using .option("maxRecordsPerFile", "x")

Estimate the number of partition

import numpy as np
 
 
print ("\n\n*********************************************************************************")
 
"""
IF THE SHUFLLE PARTITION IS NOT CHANGE IN THE SPARK UI, PLEASE CHECK THE FOLLOWING:
- REMOVE THE TABLE IN MINIO UNDER THE SAME NAME OF THE RUNNING SPARK APP
"""
CONS_SHUFFLE_RATIO_ROWS = 1/21
CONST_MB_TO_BYTES = 1024 * 1024
 
total_num_rows = 5932823 <- you only need this as input
 
seed_no = 123
num_cores = int(32)
total_available_memory_gb= 125
reserved_memory_gb = 35
 
target_input_max_bytes_per_file_mb = 4
target_input_max_bytes_per_file_bytes = target_input_max_bytes_per_file_mb * CONST_MB_TO_BYTES
 
max_records_per_file = 1_000
 
num_shuffle_partitions = int(np.ceil(total_num_rows * CONS_SHUFFLE_RATIO_ROWS))
 
print(f"{'SEED NO':<30} : {seed_no}")
print(f"{'NUM CORES':<30} : {num_cores}")
print(f"{'TOTAL MEMORY DRIVER':<30} : {total_available_memory_gb-reserved_memory_gb} gb")
print (f"{'MEMORY PER CORE':<30} : {(total_available_memory_gb - reserved_memory_gb)/num_cores} gb")
 
print(f"{'MAX INPUT SIZE':<30} : {target_input_max_bytes_per_file_mb} mb / {target_input_max_bytes_per_file_bytes} bytes")
print (f"{'SHUFFLE PARTITION':<30} : {num_shuffle_partitions} partitions")
print (f"{'MAX RECORDS PER FILE':<30} : {max_records_per_file} records")
print ("*********************************************************************************\n\n")

Explorer

Recent Posts

Apache Spark Configuration Experiences

Graph View

Backlinks