If you want to understand Adaptive Query Execution Framework, Please check this section.
Shuffle Optimizations
As you know shuffling is one of the critical things for any query which involves it. If you can find that correct number of partitions for shuffle then its great, but that is always been challenging. This is challenging because the amount of data varies from query to query, stage to stage (because you apply filter, join etc. in query). And for each stage if you are using same number of shuffle partitions then it would have small size tasks which is inefficient for the Spark scheduler or if a smaller number of big tasks then it would have excessive garbage collection overhead and disk spilling.
Here, Adaptive Query Framework helps a lot, in this case AQE adjusts the shuffle partitions number automatically at each stage of the query. That is decided based on the size of the map-side shuffle output. Hence, of data sizes grow or shrinks with each different stage. Task size will roughly remain the same, neither too big nor too small.
Does AQE framework set the map-side number of partitions, answer is No. Then how it works? For that you have to do either of the following thing AQE to work perfectly
High number of shuffle partitions: User should be able to set high number of shuffle partitions using SQL config spark.sql.shuffle.partitions.
Auto Optimization: If you are using Databricks then you can enable Databricks edge feature “Auto Optimized Shuffle” using the setting spark.databricks.adaptive.autoOptimizeShuffle.enabled to true
Selecting correct join strategies
When cost based optimized is used very important decision which is made for join is, which join strategies to be used. And selection of join strategy is depending on the estimation of the join relations. There are chance that estimation can go wrong as below
Overestimation: in this case less efficient join would be selected.
Underestimation: This can lead to out-of-memory error.
To avoid that AQE framework help in switching to the faster broadcast hash join during execution time.
Taking care of skew joins
When data is un-evenly distributed across the partitions then this can lead to a data skew problem. Because of data skew we can face performance issues in sort merge joins. And some of the tasks will run longer which slow down the entire stage. And there are chances that data would be spilled to the disk from memory and further slow down the task.
While doing runtime optimization optimizer can find the skewness in the partitions with the updated statistics and split the partitions in smaller partitions which can help in improving the performance.
Get Databricks Spark/PySpark Certification All 230+ Questions and Answer