Recently, I encountered an intermittent issue in our Jenkins pipeline that was both frustrating and challenging to debug. The error message looked like this:
wrapper script does not seem to be touching the log file in /usr/local/jenkins/workspace/integrated_csp-qa_e2e-tests_main@tmp/durable-ba915d12
(JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
This issue occurred in about 1 out of every 10 builds, and after some investigation, I was able to identify the root cause and implement a solution. Here’s a detailed breakdown of the problem and how I resolved it.
Context
- Pipeline Setup:
- The Jenkins pipeline runs UI end-to-end (E2E) tests using Playwright inside a Docker container.
- These tests launch a Chromium browser to perform various UI interactions.
- Here is the sample Jenkinsfile:
buildUtils = new com.example.BuildUtils(this)
pipeline {
agent {
docker {
label 'build'
image buildUtils.DEFAULT_E2E_TESTS_IMAGE
}
}
stages {
stage('Test') {
steps {
sh 'make e2e-tests'
}
}
}
post {
always {
cleanWs()
}
}
}
- Symptoms:
- The pipeline would hang intermittently, and Jenkins would terminate the process after 8 minutes of inactivity. See detailed logs:
14:59:58 2025-03-26 14:59:57 +0000 [ INFO] - message: Waiting for dashboard to load (wait_loaded_dashboard - tests.pages.dashboard:114)
14:59:58 2025-03-26 14:59:57 +0000 [ INFO] - message: Sleeping for 2 seconds, 10 seconds till wake up (sleep - tests.utils.tool_box:80)
14:59:59 2025-03-26 14:59:59 +0000 [ INFO] - message: Sleeping for 2 seconds, 8 seconds till wake up (sleep - tests.utils.tool_box:80)
15:08:06 wrapper script does not seem to be touching the log file in /usr/local/jenkins/workspace/e2e-tests_main@tmp/durable-ba915d12
15:08:06 (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
- Initial Observations:
- The issue was not due to resource constraints on the Jenkins agents.
Debugging the Issue
To investigate further, I added a monitoring script to track resource usage and process states before the tests execution. Here’s the script I used:
stage('Pre Test') {
steps {
sh '''
nohup bash -c '
while true; do
echo "$(date)\n$(top -b -n 1)\n" >> resources_usage.log
sleep 5
done
' > /dev/null 2>&1 &
echo $! > resources_monitor_pid.txt
'''
}
}
The script continuously logs system resource usage (e.g., CPU, memory) and running processes during the pipeline execution. It runs top every 5 seconds and appends the output to resources_usage.log.
The script runs in the background using nohup to ensure it doesn’t terminate when the main process exits. The process ID (PID) of the monitoring script is saved to resources_monitor_pid.txt for cleanup later.
At the end of the pipeline, I added archiving steps to make sure we can obtain the log file:
post {
always {
script {
sh '''
if [ -f resources_monitor_pid.txt ]; then
kill $(cat resources_monitor_pid.txt) || true
rm -f resources_monitor_pid.txt
fi
'''
}
archiveArtifacts artifacts: 'resources_usage.log', allowEmptyArchive: true
cleanWs()
}
}
Findings
Here is the sample output in resources_usage.log file:
Thu Mar 26 20:47:27 UTC 2025
top - 20:47:27 up 93 days, 5:07, 0 users, load average: 0.85, 1.08, 1.49
Tasks: 54 total, 1 running, 36 sleeping, 0 stopped, 17 zombie
%Cpu(s): 4.6 us, 1.2 sy, 0.0 ni, 94.0 id, 0.0 wa, 0.1 hi, 0.0 si, 0.0 st
MiB Mem : 128558.3 total, 19855.3 free, 7346.9 used, 101356.1 buff/cache
MiB Swap: 34178.0 total, 33497.1 free, 680.9 used. 119858.7 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1656 jenkins 20 0 32.9g 143604 104896 S 106.7 0.1 0:06.50 chrome
1753 jenkins 20 0 1392.1g 156448 116256 S 73.3 0.1 0:00.42 chrome
1625 jenkins 20 0 32.8g 198852 152168 S 26.7 0.2 0:02.44 chrome
221 jenkins 20 0 1324624 83320 40876 S 6.7 0.1 0:04.54 node
1 jenkins 20 0 26216 1424 1232 S 0.0 0.0 0:00.04 cat
13 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.00 sh
27 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.00 sh
33 jenkins 20 0 15068 3408 3028 S 0.0 0.0 0:00.06 bash
45 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.00 sh
133 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.00 sh
163 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.00 sh
188 jenkins 20 0 15072 1832 1472 S 0.0 0.0 0:00.00 sh
190 jenkins 20 0 15072 2096 1728 S 0.0 0.0 0:00.05 sh
191 jenkins 20 0 15068 3340 3036 S 0.0 0.0 0:00.00 sh
194 jenkins 20 0 57660 16952 7808 S 0.0 0.0 0:00.60 poe
200 jenkins 20 0 445208 128060 21816 S 0.0 0.1 0:05.80 pytest
233 jenkins 20 0 32.7g 200148 155548 S 0.0 0.2 0:06.08 chrome
235 jenkins 20 0 32.0g 3308 3024 S 0.0 0.0 0:00.00 chrome_+
237 jenkins 20 0 32.0g 1588 1396 S 0.0 0.0 0:00.00 chrome_+
242 jenkins 20 0 32.5g 63852 48824 S 0.0 0.0 0:00.03 chrome
243 jenkins 20 0 32.5g 63988 48960 S 0.0 0.0 0:00.06 chrome
263 jenkins 20 0 32.9g 113116 91408 S 0.0 0.1 0:18.90 chrome
287 jenkins 20 0 32.5g 106892 91328 S 0.0 0.1 0:01.06 chrome
288 jenkins 20 0 32.6g 50196 34292 S 0.0 0.0 0:00.04 chrome
489 jenkins 20 0 32.7g 197820 153836 S 0.0 0.2 0:05.26 chrome
491 jenkins 20 0 32.0g 3452 3168 S 0.0 0.0 0:00.00 chrome_+
493 jenkins 20 0 32.0g 1560 1372 S 0.0 0.0 0:00.00 chrome_+
498 jenkins 20 0 32.5g 63204 48316 S 0.0 0.0 0:00.03 chrome
499 jenkins 20 0 32.5g 64076 49040 S 0.0 0.0 0:00.06 chrome
520 jenkins 20 0 32.9g 113684 91876 S 0.0 0.1 0:15.46 chrome
521 jenkins 20 0 32.5g 107452 92224 S 0.0 0.1 0:00.92 chrome
522 jenkins 20 0 32.6g 51856 35924 S 0.0 0.0 0:00.03 chrome
756 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.00 chrome_+
758 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.00 chrome_+
763 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.04 chrome
764 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.06 chrome
1037 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.00 chrome_+
1039 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.00 chrome_+
1044 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.04 chrome
1045 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.06 chrome
1327 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.00 chrome_+
1329 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.00 chrome_+
1334 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.04 chrome
1335 jenkins 20 0 0 0 0 Z 0.0 0.0 0:00.06 chrome
1627 jenkins 20 0 32.0g 3288 3008 S 0.0 0.0 0:00.00 chrome_+
1629 jenkins 20 0 32.0g 1616 1428 S 0.0 0.0 0:00.00 chrome_+
1634 jenkins 20 0 32.5g 63612 48704 S 0.0 0.0 0:00.03 chrome
1635 jenkins 20 0 32.5g 64152 49124 S 0.0 0.0 0:00.04 chrome
1657 jenkins 20 0 32.6g 111520 87772 S 0.0 0.1 0:00.50 chrome
1660 jenkins 20 0 32.6g 46376 30316 S 0.0 0.0 0:00.02 chrome
1752 jenkins 20 0 26080 1612 1416 S 0.0 0.0 0:00.00 sleep
1762 jenkins 20 0 1392.1g 60080 40476 S 0.0 0.0 0:00.02 chrome
1776 jenkins 20 0 15068 380 0 S 0.0 0.0 0:00.00 bash
1777 jenkins 20 0 52176 4296 3712 R 0.0 0.0 0:00.00 top
After analyzing the file, I discovered the following:
- There were numerous zombie (Z) processes for
shandchrome. This indicated that child processes were not being properly reaped by the parent process. - Chromium was consuming an unusually high amount of CPU, which likely contributed to the pipeline hanging.
Upon reviewing the Playwright documentation, I found recommendations for running Chromium inside Docker containers at here.
Solution
Based on these findings, I updated the Docker configuration in the Jenkins pipeline to include the recommended flags:
--ipc=host: This flag allows Chromium to share memory more efficiently with the host, preventing crashes due to memory exhaustion.--init: This flag ensures that the container uses a proper init system to handle process reaping, preventing zombie processes.
Here’s the updated agent block:
agent {
docker {
label 'build'
image buildUtils.DEFAULT_E2E_TESTS_IMAGE
args '--ipc=host --init'
}
}
Results
After applying these changes, the issue was completely resolved:
- No more zombie processes were observed.
- Chromium’s CPU usage stabilized.
- All builds became stable, with no intermittent failures.
Key Takeaways
Monitor Resource Usage: Adding a resource monitoring script can provide valuable insights into what’s happening during pipeline execution.
Follow Best Practices for Dockerized Applications: Always review the documentation for tools like Playwright to ensure you’re using the recommended configurations for Docker.
Use
--ipc=hostand--init: These flags are essential when running Chromium inside Docker to prevent memory issues and zombie processes.
P/S: Powered by Github Copilot