Recently, I encountered an intermittent issue in our Jenkins pipeline that was both frustrating and challenging to debug. The error message looked like this:

wrapper script does not seem to be touching the log file in /usr/local/jenkins/workspace/integrated_csp-qa_e2e-tests_main@tmp/durable-ba915d12
(JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

This issue occurred in about 1 out of every 10 builds, and after some investigation, I was able to identify the root cause and implement a solution. Here’s a detailed breakdown of the problem and how I resolved it.


Context

  1. Pipeline Setup:
  • The Jenkins pipeline runs UI end-to-end (E2E) tests using Playwright inside a Docker container.
  • These tests launch a Chromium browser to perform various UI interactions.
  • Here is the sample Jenkinsfile:
buildUtils = new com.example.BuildUtils(this)

pipeline {
    agent {
        docker {
            label 'build'
            image buildUtils.DEFAULT_E2E_TESTS_IMAGE
        }
    }
    stages {
        stage('Test') {
            steps {
                sh 'make e2e-tests'
            }
        }
    }
    post {
        always {
            cleanWs()
        }
    }
}
  1. Symptoms:
  • The pipeline would hang intermittently, and Jenkins would terminate the process after 8 minutes of inactivity. See detailed logs:
14:59:58  2025-03-26 14:59:57 +0000 [    INFO] - message: Waiting for dashboard to load (wait_loaded_dashboard - tests.pages.dashboard:114)
14:59:58  2025-03-26 14:59:57 +0000 [    INFO] - message: Sleeping for 2 seconds, 10 seconds till wake up (sleep - tests.utils.tool_box:80)
14:59:59  2025-03-26 14:59:59 +0000 [    INFO] - message: Sleeping for 2 seconds, 8 seconds till wake up (sleep - tests.utils.tool_box:80)
15:08:06  wrapper script does not seem to be touching the log file in /usr/local/jenkins/workspace/e2e-tests_main@tmp/durable-ba915d12
15:08:06  (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
  1. Initial Observations:
  • The issue was not due to resource constraints on the Jenkins agents.

Debugging the Issue

To investigate further, I added a monitoring script to track resource usage and process states before the tests execution. Here’s the script I used:

stage('Pre Test') {
    steps {
        sh '''
            nohup bash -c '
            while true; do
                echo "$(date)\n$(top -b -n 1)\n" >> resources_usage.log
                sleep 5
            done
            ' > /dev/null 2>&1 &
            echo $! > resources_monitor_pid.txt
        '''
    }
}

The script continuously logs system resource usage (e.g., CPU, memory) and running processes during the pipeline execution. It runs top every 5 seconds and appends the output to resources_usage.log. The script runs in the background using nohup to ensure it doesn’t terminate when the main process exits. The process ID (PID) of the monitoring script is saved to resources_monitor_pid.txt for cleanup later.

At the end of the pipeline, I added archiving steps to make sure we can obtain the log file:

post {
    always {
        script {
            sh '''
                if [ -f resources_monitor_pid.txt ]; then
                    kill $(cat resources_monitor_pid.txt) || true
                    rm -f resources_monitor_pid.txt
                fi
            '''
        }
        archiveArtifacts artifacts: 'resources_usage.log', allowEmptyArchive: true
        cleanWs()
    }
}

Findings

Here is the sample output in resources_usage.log file:

Thu Mar 26 20:47:27 UTC 2025
top - 20:47:27 up 93 days,  5:07,  0 users,  load average: 0.85, 1.08, 1.49
Tasks:  54 total,   1 running,  36 sleeping,   0 stopped,  17 zombie
%Cpu(s):  4.6 us,  1.2 sy,  0.0 ni, 94.0 id,  0.0 wa,  0.1 hi,  0.0 si,  0.0 st
MiB Mem : 128558.3 total,  19855.3 free,   7346.9 used, 101356.1 buff/cache
MiB Swap:  34178.0 total,  33497.1 free,    680.9 used. 119858.7 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1656 jenkins   20   0   32.9g 143604 104896 S 106.7   0.1   0:06.50 chrome
   1753 jenkins   20   0 1392.1g 156448 116256 S  73.3   0.1   0:00.42 chrome
   1625 jenkins   20   0   32.8g 198852 152168 S  26.7   0.2   0:02.44 chrome
    221 jenkins   20   0 1324624  83320  40876 S   6.7   0.1   0:04.54 node
      1 jenkins   20   0   26216   1424   1232 S   0.0   0.0   0:00.04 cat
     13 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.00 sh
     27 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.00 sh
     33 jenkins   20   0   15068   3408   3028 S   0.0   0.0   0:00.06 bash
     45 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.00 sh
    133 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.00 sh
    163 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.00 sh
    188 jenkins   20   0   15072   1832   1472 S   0.0   0.0   0:00.00 sh
    190 jenkins   20   0   15072   2096   1728 S   0.0   0.0   0:00.05 sh
    191 jenkins   20   0   15068   3340   3036 S   0.0   0.0   0:00.00 sh
    194 jenkins   20   0   57660  16952   7808 S   0.0   0.0   0:00.60 poe
    200 jenkins   20   0  445208 128060  21816 S   0.0   0.1   0:05.80 pytest
    233 jenkins   20   0   32.7g 200148 155548 S   0.0   0.2   0:06.08 chrome
    235 jenkins   20   0   32.0g   3308   3024 S   0.0   0.0   0:00.00 chrome_+
    237 jenkins   20   0   32.0g   1588   1396 S   0.0   0.0   0:00.00 chrome_+
    242 jenkins   20   0   32.5g  63852  48824 S   0.0   0.0   0:00.03 chrome
    243 jenkins   20   0   32.5g  63988  48960 S   0.0   0.0   0:00.06 chrome
    263 jenkins   20   0   32.9g 113116  91408 S   0.0   0.1   0:18.90 chrome
    287 jenkins   20   0   32.5g 106892  91328 S   0.0   0.1   0:01.06 chrome
    288 jenkins   20   0   32.6g  50196  34292 S   0.0   0.0   0:00.04 chrome
    489 jenkins   20   0   32.7g 197820 153836 S   0.0   0.2   0:05.26 chrome
    491 jenkins   20   0   32.0g   3452   3168 S   0.0   0.0   0:00.00 chrome_+
    493 jenkins   20   0   32.0g   1560   1372 S   0.0   0.0   0:00.00 chrome_+
    498 jenkins   20   0   32.5g  63204  48316 S   0.0   0.0   0:00.03 chrome
    499 jenkins   20   0   32.5g  64076  49040 S   0.0   0.0   0:00.06 chrome
    520 jenkins   20   0   32.9g 113684  91876 S   0.0   0.1   0:15.46 chrome
    521 jenkins   20   0   32.5g 107452  92224 S   0.0   0.1   0:00.92 chrome
    522 jenkins   20   0   32.6g  51856  35924 S   0.0   0.0   0:00.03 chrome
    756 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.00 chrome_+
    758 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.00 chrome_+
    763 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.04 chrome
    764 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.06 chrome
   1037 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.00 chrome_+
   1039 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.00 chrome_+
   1044 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.04 chrome
   1045 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.06 chrome
   1327 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.00 chrome_+
   1329 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.00 chrome_+
   1334 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.04 chrome
   1335 jenkins   20   0       0      0      0 Z   0.0   0.0   0:00.06 chrome
   1627 jenkins   20   0   32.0g   3288   3008 S   0.0   0.0   0:00.00 chrome_+
   1629 jenkins   20   0   32.0g   1616   1428 S   0.0   0.0   0:00.00 chrome_+
   1634 jenkins   20   0   32.5g  63612  48704 S   0.0   0.0   0:00.03 chrome
   1635 jenkins   20   0   32.5g  64152  49124 S   0.0   0.0   0:00.04 chrome
   1657 jenkins   20   0   32.6g 111520  87772 S   0.0   0.1   0:00.50 chrome
   1660 jenkins   20   0   32.6g  46376  30316 S   0.0   0.0   0:00.02 chrome
   1752 jenkins   20   0   26080   1612   1416 S   0.0   0.0   0:00.00 sleep
   1762 jenkins   20   0 1392.1g  60080  40476 S   0.0   0.0   0:00.02 chrome
   1776 jenkins   20   0   15068    380      0 S   0.0   0.0   0:00.00 bash
   1777 jenkins   20   0   52176   4296   3712 R   0.0   0.0   0:00.00 top

After analyzing the file, I discovered the following:

  • There were numerous zombie (Z) processes for sh and chrome. This indicated that child processes were not being properly reaped by the parent process.
  • Chromium was consuming an unusually high amount of CPU, which likely contributed to the pipeline hanging.

Upon reviewing the Playwright documentation, I found recommendations for running Chromium inside Docker containers at here.


Solution

Based on these findings, I updated the Docker configuration in the Jenkins pipeline to include the recommended flags:

  1. --ipc=host: This flag allows Chromium to share memory more efficiently with the host, preventing crashes due to memory exhaustion.

  2. --init: This flag ensures that the container uses a proper init system to handle process reaping, preventing zombie processes.

Here’s the updated agent block:

agent {
    docker {
        label 'build'
        image buildUtils.DEFAULT_E2E_TESTS_IMAGE
        args '--ipc=host --init'
    }
}

Results

After applying these changes, the issue was completely resolved:

  • No more zombie processes were observed.
  • Chromium’s CPU usage stabilized.
  • All builds became stable, with no intermittent failures.

Key Takeaways

  1. Monitor Resource Usage: Adding a resource monitoring script can provide valuable insights into what’s happening during pipeline execution.

  2. Follow Best Practices for Dockerized Applications: Always review the documentation for tools like Playwright to ensure you’re using the recommended configurations for Docker.

  3. Use --ipc=host and --init: These flags are essential when running Chromium inside Docker to prevent memory issues and zombie processes.

P/S: Powered by Github Copilot