This post describes Pinterest's Auto Memory Retries feature for Apache Spark, which automatically retries OOM-failed tasks on larger executors to reduce failures and resource waste.
- •Over 4.6% of Pinterest's 90k+ daily Spark jobs fail due to OOM errors, consuming significant compute and creating on-call burden
- •The hybrid retry strategy first doubles CPU-per-task (letting the task monopolize executor memory), then launches physically larger executors (2x, 3x, 4x profiles) if OOM persists
- •Core Spark classes (Task, TaskSetManager, TaskSchedulerImpl, ExecutorAllocationManager) were extended rather than using a Spark listener approach for finer scheduling control
- •Off-heap memory (used with Apache Gluten/Velox) is also doubled during retries; SparkUI was updated to display task resource profile IDs
This summary was automatically generated by AI based on the original article and may not be fully accurate.