Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Jobs: Parts of Test Suite Fail Regularly #2439

Closed
sanssecours opened this issue Feb 25, 2019 · 14 comments
Closed

Build Jobs: Parts of Test Suite Fail Regularly #2439

sanssecours opened this issue Feb 25, 2019 · 14 comments

Comments

@sanssecours
Copy link
Member

sanssecours commented Feb 25, 2019

Description

I opened this as an issue to keep track all of the temporary test failures in one of the build jobs. The main reasons for the build failures are

. In a recent PR I had to restart the Jenkins build job 5 times before everything worked. In the PR after that I restarted the Jenkins build job thrice, as far as I can remember. Anyway, the failure rate is much too high in my opinion.

Failures

Location Failed Tests Build Job
master testmod_gpgme (1) debian-stable-full
master testmod_gpgme (1), testmod_zeromqsend (1) debian-stable-full-ini
master testmod_crypto_botan (1), testmod_fcrypt (1), testmod_gpgme (2), testmod_zeromqsend (1) debian-stable-full-mmap
master testmod_crypto_botan (1), testmod_fcrypt (2) debian-unstable-full
master testmod_crypto_botan (2), testmod_crypto_openssl (3), testmod_fcrypt (1) debian-unstable-full-clang
PR #2442 testmod_crypto_openssl (1), testmod_gpgme (1) debian-stable-full-ini
PR #2442 testmod_crypto_openssl (1), testmod_crypto_botan (1), testmod_fcrypt (1), testmod_gpgme (3) debian-stable-full-mmap
PR #2442 testmod_crypto_openssl (1), testmod_fcrypt (1) debian-unstable-full
PR #2442 testmod_crypto_openssl (1), testmod_crypto_botan (1), testmod_fcrypt (1) debian-unstable-full-clang
PR #2442 testmod_dbus (1), testmod_dbusrecv (1) 🍎 MMap
PR #2443 testmod_crypto_botan (1), testmod_fcrypt (1) debian-unstable-full
PR #2443 testmod_crypto_openssl(1), testmod_crypto_botan (1) debian-unstable-full-clang
PR #2443 testmod_dbus (1), testmod_dbusrecv (1) 🍎 MMap
PR #2445 testmod_crypto_openssl (1), testmod_crypto_botan (1), testmod_fcrypt (1) debian-stable-full-ini
PR #2445 testmod_crypto_openssl (2), testmod_crypto_botan (2), testmod_fcrypt (2), testmod_gpgme (1) debian-stable-full-mmap
PR #2445 testmod_crypto_openssl (2), testmod_fcrypt (2) debian-unstable-full
PR #2445 testmod_dbus (1), testmod_dbusrecv (1) 🍏 GCC
@markus2330
Copy link
Contributor

Thank you for your summary of these problems!

Is it maybe possible to disable the jobs only at the places where they are failing?

@petermax2
Copy link
Member

For the crypto and fcrypt plugin @mpranj pointed out that gpg-agent may fail in case of high server load. Maybe we could create a separate build job for the crypto and fcrypt plugin tests? So that other developments are not being blocked.

@markus2330
Copy link
Contributor

Thank you for your input!

Separating the problematic jobs might make the rebuild cycles shorter. But I think it is clear that we do not want any manual rebuilds at all. So we have the options:

  • making it more reliable
  • some automatic loops which retry on such errors
  • disabling the tests (when someone works on these parts, she needs to activate them again)

What do you think?

@petermax2
Copy link
Member

  • making it more reliable

hardly possible as long as we utilize gpg-agent (which is a pain in batch jobs)

  • some automatic loops which retry on such errors

This feels dirty to me.

  • disabling the tests (when someone works on these parts, she needs to activate them again)

Seems to be the option that causes the least discomfort, although having manual regression tests is not nice either.

@markus2330
Copy link
Contributor

As discussed in the meeting: we should disable the tests.

@kodebach
Copy link
Member

Alternative also discussed in the meeting: Using ctest --rerun-failed

Running ctest creates the file <cmake_build_dir>/Testing/Temporary/LastTestsFailed[_timestamp].log (the timestamp is only used in Dashboard mode). This file is also used by ctest --rerun-failed (see Kitware/CMake@eb2decc). It simply contains the numbers and names of the tests that last failed.

My proposal would to call ctest as before. If if exits unsuccessfully, use grep on LastTestsFailed.log to check if one of the tests listed above failed. And only then use ctest --rerun-failed. This causes less duplicate/confusing output.

But if the problem really is high server load that won't help much. Instead we could try ctest --test-load. This should cause ctest to keep CPU load below a certain threshold.

IMO still the best option would be to disable the tests and create a small build job that only installs the dependencies needed by these plugins/libraries, only compiles what is necessary and only run the problematic tests. That way we could get the runtime probably done to a few minutes, in which case manual restarting would be acceptable I think. For comparison our FreeBSD jobs currently take about 10 min (7 min build, 2 min test, 1 min other) to run ~200 tests.

PS. Not sure about our setup, but restarting a jenkins pipeline from a certain stage should be possible

@markus2330
Copy link
Contributor

Alternative also discussed in the meeting: Using ctest --rerun-failed

Thank you for looking into it!

But if the problem really is high server load that won't help much. Instead we could try ctest --test-load.

@ingwinlu did a lot of work in this direction. Our servers have the highest throughput with high load. I.e. we would slow down our tests with such options.

IMO still the best option would be to disable the tests and create a small build job that only installs the

Modular test cases is very difficult to achive and maintain. @ingwinlu put a lot of work into it. I think we cannot put this effort again only for a few unreliable tests.

PS. Not sure about our setup, but restarting a jenkins pipeline from a certain stage should be possible

That would be great. But I do not see the restart button in our GUI. Do we need another plugin or a newer version? @ingwinlu tried to add "jenkins build * please" for all pipeline steps, unfortunately, it did not work.

sanssecours added a commit to sanssecours/elektra that referenced this issue Mar 1, 2019
This update should get rid of most of the temporary test failures
reported in issue [ElektraInitiative#2439](https://issues.libelektra.org/2439).

This commit closes ElektraInitiative#2439.
sanssecours added a commit to sanssecours/elektra that referenced this issue Mar 1, 2019
This update should get rid of most of the temporary test failures
reported in issue [ElektraInitiative#2439](https://issues.libelektra.org/2439).

This commit closes ElektraInitiative#2439.
@markus2330
Copy link
Contributor

It seems like we still have failures (dbus see #2532)

@markus2330
Copy link
Contributor

What about excluding the dbus test cases for the Mac builds?

@dominicjaeger
Copy link
Contributor

It seems like we still have failures (dbus see #2532)

Yes, we do.

gcc --version

Configured with: --prefix=/Applications/Xcode-10.2.1.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode-10.2.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1

Apple LLVM version 10.0.1 (clang-1001.0.46.4)

Target: x86_64-apple-darwin18.5.0

Thread model: posix

InstalledDir: /Applications/Xcode-10.2.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

(...)

DBUSRECV TESTS

==============

testing prerequisites

detecting available bus types - please ignore single error messages prefixed with "connect:"

connect: Failed to open connection to system message bus: Failed to connect to socket /usr/local/var/run/dbus/system_bus_socket: No such file or directory

test commit

test adding keys

../src/plugins/dbusrecv/testmod_dbusrecv.c:228: error in test_keyAdded: string "system/tests/testmod_dbusrecv/added" is not equal to "user/tests/foo/bar"

	compared: expectedKeyName and keyName (test_callbackKey)

test adding keys

testmod_dbusrecv Results: 34 Tests done — 1 error.

@markus2330
Copy link
Contributor

Were you able to reproduce it locally?

We still do not know why this problem sporadically occurs. If you have any input, it would be great.

Maybe we can simply exclude the tests from the problematic build jobs? Or do the dbus* testcases fail on every build job where it runs?

@dominicjaeger
Copy link
Contributor

Were you able to reproduce it locally?

Unfortunately not. I'm on Ubuntu.

Maybe we can simply exclude the tests from the problematic build jobs? Or do the dbus* testcases fail on every build job where it runs?

I just restarted the build job to see if it happens again.

@petermax2
Copy link
Member

Please re-assign me if neccessary.

@markus2330
Copy link
Contributor

I now implemented automatic retry of ctest in #3224. If you still experience temporary failures of the test suites please reopen the issue. (We can increase the number of tries.)

For other failures of Jenkins/Docker, we need to find other solutions but first we finally need to do the migration. So please continue to restart the job in these cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants