Build Jobs: Parts of Test Suite Fail Regularly #2439

sanssecours · 2019-02-25T07:48:54Z

Description

I opened this as an issue to keep track all of the temporary test failures in one of the build jobs. The main reasons for the build failures are

the tests for the notification bindings/plugins, and
the tests for the crypto plugin (this only applies to the Jenkins build server)

. In a recent PR I had to restart the Jenkins build job 5 times before everything worked. In the PR after that I restarted the Jenkins build job thrice, as far as I can remember. Anyway, the failure rate is much too high in my opinion.

Failures

Location	Failed Tests	Build Job
`master`	`testmod_gpgme` (1)	`debian-stable-full`
`master`	`testmod_gpgme` (1), `testmod_zeromqsend` (1)	`debian-stable-full-ini`
`master`	`testmod_crypto_botan` (1), `testmod_fcrypt` (1), `testmod_gpgme` (2), `testmod_zeromqsend` (1)	`debian-stable-full-mmap`
`master`	`testmod_crypto_botan` (1), `testmod_fcrypt` (2)	`debian-unstable-full`
`master`	`testmod_crypto_botan` (2), `testmod_crypto_openssl` (3), `testmod_fcrypt` (1)	`debian-unstable-full-clang`
`PR #2442`	`testmod_crypto_openssl` (1), `testmod_gpgme` (1)	`debian-stable-full-ini`
`PR #2442`	`testmod_crypto_openssl` (1), `testmod_crypto_botan` (1), `testmod_fcrypt` (1), `testmod_gpgme` (3)	`debian-stable-full-mmap`
`PR #2442`	`testmod_crypto_openssl` (1), `testmod_fcrypt` (1)	`debian-unstable-full`
`PR #2442`	`testmod_crypto_openssl` (1), `testmod_crypto_botan` (1), `testmod_fcrypt` (1)	`debian-unstable-full-clang`
`PR #2442`	`testmod_dbus` (1), `testmod_dbusrecv` (1)	`🍎 MMap`
`PR #2443`	`testmod_crypto_botan` (1), `testmod_fcrypt` (1)	`debian-unstable-full`
`PR #2443`	`testmod_crypto_openssl`(1), `testmod_crypto_botan` (1)	`debian-unstable-full-clang`
`PR #2443`	`testmod_dbus` (1), `testmod_dbusrecv` (1)	`🍎 MMap`
`PR #2445`	`testmod_crypto_openssl` (1), `testmod_crypto_botan` (1), `testmod_fcrypt` (1)	`debian-stable-full-ini`
`PR #2445`	`testmod_crypto_openssl` (2), `testmod_crypto_botan` (2), `testmod_fcrypt` (2), `testmod_gpgme` (1)	`debian-stable-full-mmap`
`PR #2445`	`testmod_crypto_openssl` (2), `testmod_fcrypt` (2)	`debian-unstable-full`
`PR #2445`	`testmod_dbus` (1), `testmod_dbusrecv` (1)	`🍏 GCC`

The text was updated successfully, but these errors were encountered:

markus2330 · 2019-02-25T17:46:16Z

Thank you for your summary of these problems!

Is it maybe possible to disable the jobs only at the places where they are failing?

petermax2 · 2019-02-26T18:50:37Z

For the crypto and fcrypt plugin @mpranj pointed out that gpg-agent may fail in case of high server load. Maybe we could create a separate build job for the crypto and fcrypt plugin tests? So that other developments are not being blocked.

markus2330 · 2019-02-26T19:12:56Z

Thank you for your input!

Separating the problematic jobs might make the rebuild cycles shorter. But I think it is clear that we do not want any manual rebuilds at all. So we have the options:

making it more reliable
some automatic loops which retry on such errors
disabling the tests (when someone works on these parts, she needs to activate them again)

What do you think?

petermax2 · 2019-02-27T07:10:58Z

making it more reliable

hardly possible as long as we utilize gpg-agent (which is a pain in batch jobs)

some automatic loops which retry on such errors

This feels dirty to me.

disabling the tests (when someone works on these parts, she needs to activate them again)

Seems to be the option that causes the least discomfort, although having manual regression tests is not nice either.

markus2330 · 2019-02-27T16:03:36Z

As discussed in the meeting: we should disable the tests.

kodebach · 2019-02-27T20:22:55Z

Alternative also discussed in the meeting: Using ctest --rerun-failed

Running ctest creates the file <cmake_build_dir>/Testing/Temporary/LastTestsFailed[_timestamp].log (the timestamp is only used in Dashboard mode). This file is also used by ctest --rerun-failed (see Kitware/CMake@eb2decc). It simply contains the numbers and names of the tests that last failed.

My proposal would to call ctest as before. If if exits unsuccessfully, use grep on LastTestsFailed.log to check if one of the tests listed above failed. And only then use ctest --rerun-failed. This causes less duplicate/confusing output.

But if the problem really is high server load that won't help much. Instead we could try ctest --test-load. This should cause ctest to keep CPU load below a certain threshold.

IMO still the best option would be to disable the tests and create a small build job that only installs the dependencies needed by these plugins/libraries, only compiles what is necessary and only run the problematic tests. That way we could get the runtime probably done to a few minutes, in which case manual restarting would be acceptable I think. For comparison our FreeBSD jobs currently take about 10 min (7 min build, 2 min test, 1 min other) to run ~200 tests.

PS. Not sure about our setup, but restarting a jenkins pipeline from a certain stage should be possible

markus2330 · 2019-02-28T10:18:06Z

Alternative also discussed in the meeting: Using ctest --rerun-failed

Thank you for looking into it!

But if the problem really is high server load that won't help much. Instead we could try ctest --test-load.

@ingwinlu did a lot of work in this direction. Our servers have the highest throughput with high load. I.e. we would slow down our tests with such options.

IMO still the best option would be to disable the tests and create a small build job that only installs the

Modular test cases is very difficult to achive and maintain. @ingwinlu put a lot of work into it. I think we cannot put this effort again only for a few unreliable tests.

PS. Not sure about our setup, but restarting a jenkins pipeline from a certain stage should be possible

That would be great. But I do not see the restart button in our GUI. Do we need another plugin or a newer version? @ingwinlu tried to add "jenkins build * please" for all pipeline steps, unfortunately, it did not work.

This update should get rid of most of the temporary test failures reported in issue [ElektraInitiative#2439](https://issues.libelektra.org/2439). This commit closes ElektraInitiative#2439.

markus2330 · 2019-03-25T08:37:45Z

It seems like we still have failures (dbus see #2532)

markus2330 · 2019-06-10T13:20:26Z

What about excluding the dbus test cases for the Mac builds?

dominicjaeger · 2019-08-19T14:19:36Z

It seems like we still have failures (dbus see #2532)

Yes, we do.

gcc --version

Configured with: --prefix=/Applications/Xcode-10.2.1.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode-10.2.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1

Apple LLVM version 10.0.1 (clang-1001.0.46.4)

Target: x86_64-apple-darwin18.5.0

Thread model: posix

InstalledDir: /Applications/Xcode-10.2.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

(...)

DBUSRECV TESTS

==============

testing prerequisites

detecting available bus types - please ignore single error messages prefixed with "connect:"

connect: Failed to open connection to system message bus: Failed to connect to socket /usr/local/var/run/dbus/system_bus_socket: No such file or directory

test commit

test adding keys

../src/plugins/dbusrecv/testmod_dbusrecv.c:228: error in test_keyAdded: string "system/tests/testmod_dbusrecv/added" is not equal to "user/tests/foo/bar"

	compared: expectedKeyName and keyName (test_callbackKey)

test adding keys

testmod_dbusrecv Results: 34 Tests done — 1 error.

markus2330 · 2019-08-19T15:15:58Z

Were you able to reproduce it locally?

We still do not know why this problem sporadically occurs. If you have any input, it would be great.

Maybe we can simply exclude the tests from the problematic build jobs? Or do the dbus* testcases fail on every build job where it runs?

dominicjaeger · 2019-08-19T15:40:22Z

Were you able to reproduce it locally?

Unfortunately not. I'm on Ubuntu.

Maybe we can simply exclude the tests from the problematic build jobs? Or do the dbus* testcases fail on every build job where it runs?

I just restarted the build job to see if it happens again.

petermax2 · 2019-11-06T16:22:51Z

Please re-assign me if neccessary.

markus2330 · 2019-11-17T14:20:53Z

I now implemented automatic retry of ctest in #3224. If you still experience temporary failures of the test suites please reopen the issue. (We can increase the number of tries.)

For other failures of Jenkins/Docker, we need to find other solutions but first we finally need to do the migration. So please continue to restart the job in these cases.

sanssecours added bug build labels Feb 25, 2019

markus2330 assigned waht and petermax2 Feb 25, 2019

markus2330 closed this as completed in acfcfc6 Mar 1, 2019

sanssecours mentioned this issue Mar 22, 2019

Let kdb get -v display if value comes from default #2499

Merged

5 tasks

markus2330 reopened this Mar 25, 2019

sanssecours mentioned this issue Jun 10, 2019

Add tutorial for running tests with Docker #2768

Merged

13 tasks

sanssecours mentioned this issue Aug 24, 2019

WIP: Rust bindings generation & cmake integration #2826

Merged

20 tasks

sanssecours mentioned this issue Sep 15, 2019

fcrypt / gpgme / crypto: test case fails #2341

Closed

sanssecours mentioned this issue Oct 12, 2019

Improve error messages (crypto, fcrypt, gpgme) #3058

Merged

16 tasks

petermax2 removed their assignment Nov 6, 2019

markus2330 unassigned waht Nov 6, 2019

markus2330 mentioned this issue Nov 15, 2019

Manifest for website not found #3214

Closed

markus2330 closed this as completed Nov 17, 2019

mpranj mentioned this issue Nov 23, 2019

testmod_dbusrecv: sporadic fails #2310

Closed

markus2330 mentioned this issue Nov 29, 2019

transient error in testcpp_contextual_thread #1519

Closed

kodebach mentioned this issue Oct 8, 2022

run_all --rerun-failed #4523

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build Jobs: Parts of Test Suite Fail Regularly #2439

Build Jobs: Parts of Test Suite Fail Regularly #2439

sanssecours commented Feb 25, 2019 •

edited

markus2330 commented Feb 25, 2019

petermax2 commented Feb 26, 2019

markus2330 commented Feb 26, 2019

petermax2 commented Feb 27, 2019

markus2330 commented Feb 27, 2019

kodebach commented Feb 27, 2019

markus2330 commented Feb 28, 2019

markus2330 commented Mar 25, 2019

markus2330 commented Jun 10, 2019

dominicjaeger commented Aug 19, 2019

markus2330 commented Aug 19, 2019

dominicjaeger commented Aug 19, 2019

petermax2 commented Nov 6, 2019

markus2330 commented Nov 17, 2019

Build Jobs: Parts of Test Suite Fail Regularly #2439

Build Jobs: Parts of Test Suite Fail Regularly #2439

Comments

sanssecours commented Feb 25, 2019 • edited

Description

Failures

markus2330 commented Feb 25, 2019

petermax2 commented Feb 26, 2019

markus2330 commented Feb 26, 2019

petermax2 commented Feb 27, 2019

markus2330 commented Feb 27, 2019

kodebach commented Feb 27, 2019

markus2330 commented Feb 28, 2019

markus2330 commented Mar 25, 2019

markus2330 commented Jun 10, 2019

dominicjaeger commented Aug 19, 2019

markus2330 commented Aug 19, 2019

dominicjaeger commented Aug 19, 2019

petermax2 commented Nov 6, 2019

markus2330 commented Nov 17, 2019

sanssecours commented Feb 25, 2019 •

edited