|Data Engineering

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

2026-02-17

16 min read

by Pinterest Engineering

Tags:

engineering

data

apache-spark

open-source

Read Original

Get the latest tech trends every morning

Receive daily AI-curated summaries of engineering articles from top tech companies worldwide.

Endigest AI Core Summary

This post describes Pinterest's Auto Memory Retries feature for Apache Spark, which automatically retries OOM-failed tasks on larger executors to reduce failures and resource waste.

•Over 4.6% of Pinterest's 90k+ daily Spark jobs fail due to OOM errors, consuming significant compute and creating on-call burden
•The hybrid retry strategy first doubles CPU-per-task (letting the task monopolize executor memory), then launches physically larger executors (2x, 3x, 4x profiles) if OOM persists
•Core Spark classes (Task, TaskSetManager, TaskSchedulerImpl, ExecutorAllocationManager) were extended rather than using a Spark listener approach for finer scheduling control
•Off-heap memory (used with Apache Gluten/Velox) is also doubled during retries; SparkUI was updated to display task resource profile IDs

This summary was automatically generated by AI based on the original article and may not be fully accurate.

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Get the latest tech trends every morning

Endigest AI Core Summary

Related Articles

Databricks at SIGMOD 2026

From petabytes to predictions: Easy BigQuery insights in Google Sheets

Advancing Apache Iceberg on Databricks: Iceberg v3 GA, Open Sharing, and Unified Governance

Evolving Dataflow to process massive datasets for machine learning