Skip to content

Get rid of force pod deletion#26

Closed
bart0sh wants to merge 56 commits into
pohly:dynamic-resource-allocationfrom
bart0sh:PR015-dra-e2e-get-rid-of-force-deletion
Closed

Get rid of force pod deletion#26
bart0sh wants to merge 56 commits into
pohly:dynamic-resource-allocationfrom
bart0sh:PR015-dra-e2e-get-rid-of-force-deletion

Conversation

@bart0sh

@bart0sh bart0sh commented Sep 16, 2022

Copy link
Copy Markdown

DRA e2e test used force pod deletion for test and plugin pods.
This should not be a default way of deleting pods, but rather an exception.

pohly and others added 30 commits September 1, 2022 08:39
It is useful to check new code with a stricter configuration because we want it
to be of higher quality. Code reviews also become easier when reviewers don't
need to point out inefficient code manually.

What exactly should be enabled is up for debate. The current config uses the
golangci-lint defaults plus everything that is enabled explicitly by the normal
.golangci.yaml, just to be on the safe side. Error checking in _test.go files
gets disabled because that seems to be fairly common.
When "make verify" is invoked in pull job, it will now also run golangci-lint
with the strict configuration and write an
$ARTIFACTS/golangci-lint-githubactions.log file with GitHub actions
annotations. How to get those published for the GitHub PR is open.

When "make verify" is invoked manually or in any other job, the stricter check
is skipped. That works because "PR_NUMBER" is only set for pre-merge
jobs (https://github.com/kubernetes/test-infra/blob/master/prow/jobs.md#job-environment-variables).
Because the script now explicitly selects the configuration file, the files no
longer have to be in the root directory. Having them in hack without the
leading dot is better because they then have the same owners as the script and
are more visible.

The downside is that manual invocations of golangci-lint without the parameter
no longer work.
There's no good reason why error checking should be skipped in tests: if a
function fails, the test should stop immediately. If errors are acceptable,
then ignoring the error should be explicit and visible via "_ =
someFunction()".
e2e.test usually writes stdout and stderr to the same destination
(file or console). Therefore it makes no sense to print log messages with
severity error or higher to stdout and stderr (the default) because they would
just be shown twice. Therefore the e2e.test binary overrides the klog default
before flag parsing. Command line flags can still change this.
Contextual logging cannot be enabled manually because there is no feature gate
flag. Enabling the feature unconditionally:
- should be low risk for E2E testing
- if it fails, we want to know
- is useful to get better log output from code which already supports it
This complements the other operations and will be used by some
future tests.
This is useful for running a driver on a subset of all ready nodes:

- use e2enode.GetBoundedReadySchedulableNodes with a suitable
  maximum number of nodes to determine how much nodes are available
  for a test
- define pod anti-affinity in the PodTemplate:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/instance: xxxxxxx
            topologyKey: kubernetes.io/hostname
- set the ReplicaSetSpec.Replicas value to the number of nodes
It's new code, so we should follow the latest recommendations for logging.
ResourceClaimSpec and its associated code are needed for the core API and have
to be defined there to avoid import cycles. Therefore the other types also get
defined there.
Created with "make generated_files update".
This is needed for "kubectl get". It depends on the generated swagger docs.
This is similar to the support code for generic ephemeral inline volumes.
Differences:
- to avoid stuttering, the functions are just resourceclaim.Name and
  resourceclaim.IsForPod
- resourceclaim.Name returns the right name for both cases (template
  and reference), which will simplify some code

Other helper functions check some simple status conditions.
The controller uses the exact same logic as the generic ephemeral inline volume
controller, just for inline ResourceClaimTemplate -> ResourceClaim.
The plugin handles the interaction with ResourceClaims that are referenced by a
Pod.
This is needed by the API server, but can also be used by driver developers to
validate user input early.
Dependencies need to be updated to use
github.com/container-orchestrated-devices/container-device-interface.

github.com/fsnotify/fsnotify v1.5.1 and sigs.k8s.io/yaml v1.3.0 are needed.
This call was implemented incorrectly and not actually needed as
DRA Manager doesn't checkpoint its cache due to an idempotent nature
of the DRA Plugin APIs. This call can also race with other parts of
the DRA manager implementation.

This should fix known issue: NodeUnprepareResource was not called
for the resources that were removed from the DRA manager cache by
the UpdatePodResources.
As we delete pod UID from DRA manager cache before checking how many
is left we should call NodeUnprepareResource only when there are no
pod UIDs left in the case for a certain resource claim.
Moved addPodUID and removePodUID APIs to the resource struct
as they're always called for the particular resource.

This should also fix incorrectly implemented deletePodUIs API that
was removing pod UID from all claims referencing it.
These helper packages implement the parts of a dynamic resource allocation
driver that are generic and can be used by the in-tree test driver as well as
out-of-tree vendor drivers.

Functional options make it possible to treat some parameters as
optional (logger) and extend the API later on.
The driver can be used manually against a cluster started with
local-up-cluster.sh and is also used for E2E testing. Because the tests proxy
connections from the nodes into the e2e.test binary and create/delete files via
the equivalent of "kubectl exec dd/rm", they can be run against arbitrary
clusters. Each test gets its own driver instance and resource class, therefore
they can run in parallel.
Used sets API to store UIDs of pods referencing resource claims
in the DRA Manager cache.
pohly and others added 16 commits September 1, 2022 08:51
In contrast to ginkgo.AfterEach, ginkgo.DeferCleanup runs the callback in
first-in-last-out order. Using it makes the following test code work as
expected:

    f := framework.NewDefaultFramework("some test")

    ginkgo.AfterEach(func() {
        // do something with f.ClientSet
    })

Previously, f.ClientSet was already set to nil by the framework's cleanup code.
By using DeferCleanup we can get rid of artificial nesting because it runs the
callbacks in first-in-last-out order.

While at it, tests that aren't specifically targeting kubelet are defined for a
"cluster" context where they use more than one node, if available. This is a
more realistic scenario.
kubelet isn't ready for it yet.
ginkgo.By can be used to record how long a callback runs. This particular
feature is not used much (or at all?) in Kubernetes and we don't have tooling
for analyzing the measurements.

What would be more important is text output. framework.By provides that:
- reminders that a step is still running at regular intervals,
  configurable with --slow-step-threshold
- the total duration for slow steps

Example:

    STEP: delete pods and claims 08/24/22 11:55:19.676
    STEP: deleting *v1.Pod dra-5514/tester-1 08/24/22 11:55:19.679
    STEP: deleting *v1.ResourceClaim dra-5514/tester-1-my-inline-claim 08/24/22 11:55:19.686
    STEP still running: delete pods and claims 08/24/22 11:55:29.677
    STEP still running: delete pods and claims 08/24/22 11:55:39.676
    STEP still running: delete pods and claims 08/24/22 11:55:49.677
    <additional output when the objects finally get removed>
    STEP duration: delete pods and claims 33.253509791s
We don't want to document in detail how to use local-up-cluster.sh, only show
the specific options that are needed for DRA. Figuring out which other options
may be needed is left as exercise to the reader...

Paths for files were stale.
gomega.Eventually with a function that takes a gomega.Gomega instance can be
used to poll for certain conditions. The advantage over custom code is that
gomega assertions can be used. However, gomega as it stands today does not
report why the polling continues. This is annoying when running a test
interactively: one has to wait for the timeout before additional information is
shown in the failure message.

SucceedEventually is a replacement for gomega.Eventually which is tailored
towards this use case and adds intermediate output. It's best combined with the
By replacement.

Example with some artificial failure:

    STEP: wait for plugin registration 08/24/22 14:17:23.488
    ...
    STEP still running: wait for plugin registration 08/24/22 14:17:33.489
    STEP still running: wait for plugin registration 08/24/22 14:17:43.489
    STEP still running: wait for plugin registration 08/24/22 14:17:53.489
    Aug 24 14:17:53.509: INFO: wait for plugin registration
    xxx
    Expected
        <bool>: false
    to equal
        <bool>: true
    STEP still running: wait for plugin registration 08/24/22 14:18:03.489
    STEP still running: wait for plugin registration 08/24/22 14:18:13.489
    STEP still running: wait for plugin registration 08/24/22 14:18:23.489
    Aug 24 14:18:23.489: FAIL: Timed out after 60.000s.
    wait for plugin registration
    Expected
        <string>: xxx
        Expected
            <bool>: false
        to equal
            <bool>: true
    to be empty

    Full Stack Trace
    k8s.io/kubernetes/test/e2e/dra.(*Driver).SetUp.func7()
            /nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/deploy.go:195 +0xda
    k8s.io/kubernetes/test/e2e/framework.By({0x702cfda, 0x1c}, 0xc0029d3df8, {0x0, 0x0, 0x6fbc0cb?})
            /nvme/gopath/src/k8s.io/kubernetes/test/e2e/framework/log.go:115 +0x5ea
    k8s.io/kubernetes/test/e2e/dra.(*Driver).SetUp(0xc0006af700)
            /nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/deploy.go:194 +0xa67
    k8s.io/kubernetes/test/e2e/dra.NewDriver.func1()
            /nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/deploy.go:67 +0x26
If pod deletion hangs, then we want to see that. Otherwise we only get told
that claims aren't getting deleted, but that then isn't the root cause.
By letting the test pods run to completion right away we save time because each
process starts and stops as quickly as possible.
Having all processes exit immediately is not realistic and may hide issues that
occur for longer-running processes. Another solution with the same effect (fast
test execution) is to let kubelet kill each process quickly.
Moved call of the NodePrepareResources from the admission phase
to the container configuration phase.

This should enable retrying NodePrepareResources if DRA plugin
is not ready or previous call of the NodePrepareResources
was unsuccessful.
It's not decided yet whether we will implement Topology support
for DRA or not. Removing toppology-related code
will help to avoid wrong impression that DRA is used as a hint
provider for the Topology Manager.
DRA e2e test used force pod deletion for test and plugin pods.
This should not be a default way of deleting pods, but rather
an exception.
@bart0sh bart0sh force-pushed the PR015-dra-e2e-get-rid-of-force-deletion branch from 6df3c07 to 67a1752 Compare September 16, 2022 13:32
@pohly

pohly commented Sep 16, 2022

Copy link
Copy Markdown
Owner

The commit message is a bit misleading: "used force pod deletion for test and plugin pods."

The delete itself was not using force-deletion. That would have been a DeleteOption with explicit zero as grace period.

It was a bit more indirect: because of the termination grace period being zero, Kubernetes concluded that deletion should also be immediate. It's worth calling that out in the code comment and commit message. A link to https://github.com/kubernetes/kubernetes/blob/0f582f7c3f504e807550310d00f130cb5c18c0c3/pkg/registry/core/pod/strategy.go#L151-L171 would also be useful, for future reference.

@pohly

pohly commented Sep 16, 2022

Copy link
Copy Markdown
Owner

FWIW, comments in the code are more important than commit message at the moment. Eventually all "e2e dra" commits will be squashed into one, with a commit message that won't capture all details on how the test came into existence.

@bart0sh

bart0sh commented Sep 16, 2022

Copy link
Copy Markdown
Author

The delete itself with a grace period 0 is a force deletion in my understanding.
k8s users are also aware of that I think as kubectl doesn't allow to set grace period 0 without using --force.

@bart0sh

bart0sh commented Sep 16, 2022

Copy link
Copy Markdown
Author

Referirng to https://github.com/kubernetes/kubernetes/blob/0f582f7c3f504e807550310d00f130cb5c18c0c3/pkg/registry/core/pod/strategy.go#L151-L171 would make it harder to understand in my opinion. At least it did it for me until I read this user-friendly explanation: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination-forced.

@pohly

pohly commented Sep 16, 2022

Copy link
Copy Markdown
Owner

The delete itself with a grace period 0 is a force deletion in my understanding.

DeleteOptions.GracePeriodSeconds is a pointer, therefore DeleteOptions{} is not setting GracePeriodSeconds to zero. A client has to explicitly choose force deletion, and that was not done here.

@pohly

pohly commented Sep 16, 2022

Copy link
Copy Markdown
Owner

Referirng to https://github.com/kubernetes/kubernetes/blob/0f582f7c3f504e807550310d00f130cb5c18c0c3/pkg/registry/core/pod/strategy.go#L151-L171 would make it harder to understand in my opinion. At least it did it for me until I read this user-friendly explanation: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination-forced.

It's okay to refer to both, but the user-facing documentation does not cover the handling of Spec.TerminationGracePeriodSeconds, does it? That is only in the code.

@bart0sh

bart0sh commented Sep 16, 2022

Copy link
Copy Markdown
Author

My point was that we shouldn't give too many details on that. Everybody knows that grace period 0 for pod means its force deletion. The change is self-explanatory in my opinion - we don't want force deletion as a default way of deleting pods in our tests. I can change it to whatever you want if you insist, though.

@pohly

pohly commented Sep 19, 2022

Copy link
Copy Markdown
Owner

Everybody knows that grace period 0 for pod means its force deletion.

But that's the point: the cleanup code as written at the moment doesn't set grace period 0. It's completely unexpected that it force-deleted pods. It only became a force-delete because of some setting elsewhere, and that needs to be explained. Otherwise it remains unclear why DeleteOptions.GracePeriodSeconds gets set.

@pohly

pohly commented Sep 19, 2022

Copy link
Copy Markdown
Owner

A better solution might be to create pods with a 1 second deletion period and not do anything special in the cleanup code.

@pohly pohly force-pushed the dynamic-resource-allocation branch 2 times, most recently from dd47224 to df63205 Compare September 19, 2022 10:19
@bart0sh

bart0sh commented Sep 19, 2022

Copy link
Copy Markdown
Author

ok, closing then

@bart0sh bart0sh closed this Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants