Skip to content

Troubleshooting Library

Real errors. Real fixes. No Stack Overflow copypasta.

This is a living document. Community members contribute their war stories so the next person doesn’t lose 3 hours to the same bug.


Symptom: docker-compose up runs but containers immediately stop. Cause: No foreground process to keep the container alive. Fix: Make sure your entrypoint/CMD runs in the foreground, not as a daemon. Example for Airflow: use airflow standalone not airflow webserver -D.

”Permission denied” on volume mounts (Mac/Linux)

Section titled “”Permission denied” on volume mounts (Mac/Linux)”

Symptom: Container writes fail with permission errors on mounted directories. Fix: Set the correct owner in the Dockerfile:

RUN chown -R 1000:1000 /opt/airflow/logs

Or use user: "${UID}:${GID}" in docker-compose.yml.

Cause: Usually corrupt data directory from a previous failed initialization. Fix:

Terminal window
docker-compose down -v # removes volumes too
docker-compose up

Warning: -v deletes all data.


Causes & fixes:

  1. Syntax error — run python your_dag.py directly to see the error
  2. DAG is in wrong folder — check dags_folder in airflow.cfg
  3. Scheduler hasn’t picked it up yet — wait 30s or restart scheduler
  4. catchup=True with old start_date — set catchup=False or use a recent start_date

Scheduler shows zombie tasks / tasks stuck in “running”

Section titled “Scheduler shows zombie tasks / tasks stuck in “running””

Cause: Workers died mid-task, leaving orphaned task instances. Fix:

Terminal window
airflow tasks clear <dag_id> -s <start_date> -e <end_date> --yes

Or use the UI: Browse → Task Instances → filter by state “running” → Mark Failed.

Cause: Package installed in host env but not in Airflow’s Python env. Fix: Install inside the container, or use a custom Docker image. For docker-compose:

x-airflow-common:
&airflow-common
build:
context: .
dockerfile: Dockerfile.airflow

Cause: Model references a column that doesn’t exist yet in the source or upstream model. Fixes:

  • Run dbt compile to see the full SQL before execution
  • Check that your ref() points to the correct model
  • Run dbt run --select <upstream_model>+ to build deps first

Cause: Running all tests sequentially on large tables. Fix:

  • Use --threads 4 (or more) to parallelize
  • Add WHERE filters to custom tests
  • Use store_failures = true in dbt_project.yml to skip re-running passing tests

Compilation Error: depends on a node named 'X' which was not found

Section titled “Compilation Error: depends on a node named 'X' which was not found”

Cause: Model name mismatch between ref('model_name') and the actual filename. Fix: dbt model names are case-sensitive. ref('Orders')ref('orders').


Job runs slower than equivalent Python script

Section titled “Job runs slower than equivalent Python script”

Cause: Usually too many small partitions, or data is not actually distributed. Fix:

# Check partition count
df.rdd.getNumPartitions()
# Repartition for better parallelism
df = df.repartition(200) # rule of thumb: 2-3x num cores
# For joins, broadcast small tables
from pyspark.sql.functions import broadcast
df = large_df.join(broadcast(small_df), "key")

OutOfMemoryError: GC overhead limit exceeded

Section titled “OutOfMemoryError: GC overhead limit exceeded”

Cause: Executor memory too low, or data skew causing one partition to hold too much. Fix:

spark = SparkSession.builder \
.config("spark.executor.memory", "4g") \
.config("spark.driver.memory", "2g") \
.config("spark.sql.shuffle.partitions", "200") \
.getOrCreate()

Check for skew: df.groupBy("key").count().orderBy("count", ascending=False).show().


Queries slow on large tables (45+ seconds)

Section titled “Queries slow on large tables (45+ seconds)”

Fix checklist:

  1. EXPLAIN ANALYZE your query — find the sequential scans
  2. Add indexes on columns in WHERE, JOIN, and ORDER BY
  3. Use VACUUM ANALYZE on the table
  4. For repeated aggregations: add a materialized view
-- Example: slow dashboard query fix
CREATE INDEX CONCURRENTLY idx_orders_created_at ON orders(created_at DESC);
CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders(customer_id);
-- Materialized view for expensive aggregation
CREATE MATERIALIZED VIEW mv_daily_revenue AS
SELECT DATE(created_at) as day, SUM(amount) as revenue
FROM orders GROUP BY 1;
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_revenue;

Cause: Connection pool exhausted. Fix: Use PgBouncer for connection pooling. For docker-compose setups, add pgbouncer as a service. Max connections default is 100; each Airflow worker opens its own connections.


Causes:

  1. Consumer is slower than producer (most common)
  2. Too few partitions — consumers can’t parallelize
  3. Message processing is blocking

Fixes:

  1. Increase partition count: kafka-topics.sh --alter --topic your-topic --partitions 12
  2. Use batching in consumer: max_poll_records=500
  3. Move slow processing to async workers

Messages processed twice / duplicate processing

Section titled “Messages processed twice / duplicate processing”

Cause: Consumer crashes after processing but before committing offset. Fix: Make processing idempotent. Use enable.auto.commit=false and commit only after successful processing + write.


State lock error: Error acquiring the state lock

Section titled “State lock error: Error acquiring the state lock”

Cause: Previous terraform apply was interrupted; DynamoDB lock not released. Fix:

Terminal window
terraform force-unlock <LOCK_ID>

Get lock ID from the error message. Only run this if you’re sure no other apply is running.

Error: Provider produced inconsistent result after apply

Section titled “Error: Provider produced inconsistent result after apply”

Cause: Provider bug or resource drift between plan and apply. Fix: Run terraform refresh then terraform plan again. If it persists, check provider version and pin it.


Checklist:

  1. Check the bucket policy — it may explicitly deny even with IAM allow
  2. Check if bucket has Block Public Access settings that override policy
  3. Verify the role is attached to the right resource (EC2 instance, Lambda, etc.)
  4. S3 uses account-level Block Public Access — check at account level too

Fix: For large files, use S3 Select to filter before reading, or trigger an ECS/Fargate task for heavy processing. Lambda max timeout is 15 minutes.


See something that should be here? Drop it in #troubleshooting on Discord or open a PR. Format: Symptom → Cause → Fix. Include the actual error message when possible.