Skip to content

Instantly share code, notes, and snippets.

@liweinan
Created December 22, 2025 13:54
Show Gist options
  • Select an option

  • Save liweinan/28927a870099494b6e23fc8aaf58c3c3 to your computer and use it in GitHub Desktop.

Select an option

Save liweinan/28927a870099494b6e23fc8aaf58c3c3 to your computer and use it in GitHub Desktop.
Test Case Manual Log
anan@think:~/works/openshift-versions/works$ cat install-config.yaml.bkup 
additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: qe.devcluster.openshift.com
compute:
- architecture: amd64
  hyperthreading: Disabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Disabled 
  name: master
  platform: {}
  replicas: 3
metadata:
  name: weli-test
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  aws:
    region: us-east-1
    vpc: {}
publish: External
@liweinan
Copy link
Author

# OCP-22663 - [ipi-on-aws] Pick instance types for machines per region basis

## Test Case Overview

This test case validates that the OpenShift installer correctly selects instance types for AWS machines based on regional availability. The installer uses a priority-based fallback mechanism to select the best available instance type for each region.

## Current Implementation Behavior

The installer uses the following instance type priority list for AMD64 architecture:
1. m6i.xlarge (primary preference)
2. m5.xlarge (fallback)
3. r5.xlarge (fallback)
4. c5.2xlarge (fallback)
5. m5.2xlarge (fallback)
6. c5d.2xlarge (fallback)
7. r5.2xlarge (fallback)

The installer automatically checks instance type availability in the selected region and availability zones, selecting the first available type from the priority list.

## Test Steps

### Test Case 1: Standard Region with m6i Available

Objective: Verify that the installer selects m6i.xlarge when it's available in the region.

Prerequisites:

  • AWS credentials configured
  • Access to a standard AWS region (e.g., us-east-1, us-west-2, ap-northeast-1, eu-west-1)

Steps:

1. Create the Install Config asset:

openshift-install create install-config --dir instance_types1

2. Modify the region field in install-config.yaml:

platform:
  aws:
    region: us-east-1  # or another region where m6i is available

3. Generate the Kubernetes manifests:

openshift-install create manifests --dir instance_types1

Expected Result:

  • The installer should select m6i.xlarge as the instance type
  • Verify the instance type in the generated manifests:
    grep -r instanceType: instance_types1/
  • Expected output should show:
    openshift/99_openshift-cluster-api_master-machines-0.yaml:      instanceType: m6i.xlarge
    

### Test Case 2: Region Where m6i is Not Available

Objective: Verify that the installer falls back to m5.xlarge when m6i is not available in the region.

Prerequisites:

  • AWS credentials configured
  • Access to a region where m6i instance types are not available (e.g., eu-north-1, eu-west-3, us-gov-east-1)

Steps:

1. Create the Install Config asset:

openshift-install create install-config --dir instance_types2

2. Modify the region field in install-config.yaml:

platform:
  aws:
    region: eu-west-3  # Region where m6i may not be available

3. Generate the Kubernetes manifests:

openshift-install create manifests --dir instance_types2

Expected Result:

  • The installer should detect that m6i.xlarge is not available and fall back to m5.xlarge
  • Verify the instance type in the generated manifests:
    grep -r instanceType: instance_types2/
  • Expected output should show:
    openshift/99_openshift-cluster-api_master-machines-0.yaml:      instanceType: m5.xlarge
    

### Test Case 3: Full Cluster Installation Verification

Objective: Verify that the selected instance type works correctly during actual cluster installation.

Prerequisites:

  • AWS credentials configured with sufficient permissions
  • Valid base domain and pull secret

Steps:

1. Use the install config from Test Case 1 or Test Case 2

2. Launch the cluster:

openshift-install create cluster --dir instance_types2

Expected Result:

  • Installation completes successfully
  • Master nodes are created with the expected instance type
  • Verify instance types of running instances:
    # After cluster installation, verify via AWS CLI or console
    aws ec2 describe-instances --filters "Name=tag:Name,Values=*master*" --query 'Reservations[*].Instances[*].[InstanceType,Tags[?Key==`Name`].Value|[0]]' --output table
  • Create a new project and deploy a test application to verify cluster functionality:
    oc new-project test-instance-types
    oc new-app --image=nginx --name=test-app
    oc get pods -w

## Additional Verification

### Verify Instance Type Selection Logic

To understand why a specific instance type was selected, check the installer logs:

# Enable debug logging
export OPENSHIFT_INSTALL_LOG_LEVEL=debug
openshift-install create manifests --dir instance_types1

Look for log messages related to instance type selection and availability checks.

### Manual Instance Type Availability Check

You can manually verify instance type availability in a region using AWS CLI:

# Check if m6i.xlarge is available in a specific region
aws ec2 describe-instance-type-offerings \
  --location-type availability-zone \
  --filters "Name=instance-type,Values=m6i.xlarge" \
  --region us-east-1 \
  --query 'InstanceTypeOfferings[*].Location' \
  --output table

# Check if m5.xlarge is available
aws ec2 describe-instance-type-offerings \
  --location-type availability-zone \
  --filters "Name=instance-type,Values=m5.xlarge" \
  --region eu-west-3 \
  --query 'InstanceTypeOfferings[*].Location' \
  --output table

## Notes

1. Instance Type Availability: Instance type availability can vary by region and availability zone. The installer automatically handles this by checking availability and selecting the best option.

2. Regional Overrides: If specific regions require different instance type priorities, they can be configured in pkg/types/aws/defaults/platform.go using the defaultMachineTypes map.

3. Architecture Support: This test case focuses on AMD64 architecture. ARM64 architecture uses different instance types (e.g., m6g.xlarge).

4. Version Compatibility:

  • For OpenShift 4.10 and later: Default instance type is m6i.xlarge, with fallback to m5.xlarge if m6i is not available
  • For OpenShift 4.6 to 4.9: Default instance type was m5.xlarge
  • For OpenShift 4.5 and earlier: Default instance type was m4.xlarge

## Implementation Details

This section explains how the instance type selection logic works in the codebase, including the key components and their interactions.

### 1. Instance Type Defaults Definition

Location: pkg/types/aws/defaults/platform.go

The InstanceTypes() function defines the default priority list of instance types based on architecture and topology:

// InstanceTypes returns a list of instance types, in decreasing priority order
func InstanceTypes(region string, arch types.Architecture, topology configv1.TopologyMode) []string {
    // Check for region-specific overrides first
    if classesForArch, ok := defaultMachineTypes[arch]; ok {
        if classes, ok := classesForArch[region]; ok {
            return classes
        }
    }

    instanceSize := defaultInstanceSizeHighAvailabilityTopology // "xlarge"
    // Single node topology requires larger instance (2xlarge) for 8 cores
    if topology == configv1.SingleReplicaTopologyMode {
        instanceSize = defaultInstanceSizeSingleReplicaTopology // "2xlarge"
    }

    switch arch {
    case types.ArchitectureARM64:
        return []string{
            fmt.Sprintf("m6g.%s", instanceSize),
        }
    default: // AMD64
        return []string{
            fmt.Sprintf("m6i.%s", instanceSize),  // Primary: m6i.xlarge
            fmt.Sprintf("m5.%s", instanceSize),    // Fallback 1: m5.xlarge
            fmt.Sprintf("r5.%s", instanceSize),   // Fallback 2: r5.xlarge
            "c5.2xlarge",                         // Fallback 3
            "m5.2xlarge",                         // Fallback 4
            "c5d.2xlarge",                        // Fallback 5 (Local Zone compatible)
            "r5.2xlarge",                         // Fallback 6
        }
    }
}

Key Points:

  • Returns instance types in priority order (highest to lowest)
  • Supports region-specific overrides via defaultMachineTypes map
  • Adjusts instance size based on topology (HA vs single-node)
  • Different instance types for ARM64 vs AMD64 architectures

### 2. Instance Type Selection Logic

Location: pkg/asset/machines/aws/instance_types.go

The PreferredInstanceType() function selects the best available instance type by checking availability in the specified zones:

// PreferredInstanceType returns a preferred instance type from the list of 
// instance types provided in descending order of preference
func PreferredInstanceType(ctx context.Context, meta *awsconfig.Metadata, 
    types []string, zones []string) (string, error) {
    if len(types) == 0 {
        return "", errors.New("at least one instance type required")
    }

    // Create EC2 client to query instance type availability
    client, err := awsconfig.NewEC2Client(ctx, awsconfig.EndpointOptions{
        Region:    meta.Region,
        Endpoints: meta.Services,
    })
    if err != nil {
        return "", fmt.Errorf("failed to create EC2 client: %w", err)
    }

    // Query AWS to get instance type availability per zone
    found, err := getInstanceTypeZoneInfo(ctx, client, types, zones)
    if err != nil {
        // If query fails, return first type as fallback
        return types[0], err
    }

    // Iterate through types in priority order
    for _, t := range types {
        // Check if this instance type is available in ALL required zones
        if found[t].HasAll(zones...) {
            return t, nil
        }
    }

    // If no type available in all zones, return first type with error
    return types[0], errors.New("no instance type found for the zone constraint")
}

The getInstanceTypeZoneInfo() function queries AWS EC2 API to check instance type availability:

func getInstanceTypeZoneInfo(ctx context.Context, client *ec2.Client, 
    types []string, zones []string) (map[string]sets.Set[string], error) {
    found := map[string]sets.Set[string]{}
    
    // Query AWS EC2 DescribeInstanceTypeOfferings API
    resp, err := client.DescribeInstanceTypeOfferings(ctx, &ec2.DescribeInstanceTypeOfferingsInput{
        Filters: []ec2types.Filter{
            {
                Name:   aws.String("location"),
                Values: zones,  // Filter by availability zones
            },
            {
                Name:   aws.String("instance-type"),
                Values: types,  // Filter by instance types
            },
        },
        LocationType: ec2types.LocationTypeAvailabilityZone,
    })
    if err != nil {
        return found, err
    }

    // Build a map: instance type -> set of available zones
    for _, offering := range resp.InstanceTypeOfferings {
        f, ok := found[string(offering.InstanceType)]
        if !ok {
            f = sets.New[string]()
            found[string(offering.InstanceType)] = f
        }
        f.Insert(aws.ToString(offering.Location))
    }
    return found, nil
}

Key Points:

  • Queries AWS EC2 API to check real-time instance type availability
  • Requires instance type to be available in ALL specified availability zones
  • Returns first available type from priority list
  • Falls back to first type if API query fails

### 3. Master Machine Configuration

Location: pkg/asset/machines/master.go

The master machine configuration integrates the instance type selection logic:

// When instance type is not specified by user
if mpool.InstanceType == "" {
    // Determine topology mode
    topology := configv1.HighlyAvailableTopologyMode
    if pool.Replicas != nil && *pool.Replicas == 1 {
        topology = configv1.SingleReplicaTopologyMode
    }
    
    // Get priority list of instance types
    instanceTypes := awsdefaults.InstanceTypes(
        installConfig.Config.Platform.AWS.Region,
        installConfig.Config.ControlPlane.Architecture,
        topology,
    )
    
    // Select best available instance type
    mpool.InstanceType, err = aws.PreferredInstanceType(
        ctx,
        installConfig.AWS,
        instanceTypes,
        mpool.Zones,
    )
    if err != nil {
        // If selection fails, use first type from list as fallback
        logrus.Warn(errors.Wrap(err, "failed to find default instance type"))
        mpool.InstanceType = instanceTypes[0]
    }
}

// Filter zones if instance type is not available in all default zones
if zoneDefaults {
    mpool.Zones, err = aws.FilterZonesBasedOnInstanceType(
        ctx,
        installConfig.AWS,
        mpool.InstanceType,
        mpool.Zones,
    )
    if err != nil {
        logrus.Warn(errors.Wrap(err, "failed to filter zone list"))
    }
}

Key Points:

  • Only runs when user hasn't specified an instance type
  • Determines topology (HA vs single-node) based on replica count
  • Calls InstanceTypes() to get priority list
  • Calls PreferredInstanceType() to select best available type
  • Filters zones if selected instance type isn't available in all zones

### 4. Machine Manifest Generation

Location: pkg/asset/machines/aws/machines.go

The Machines() function generates Kubernetes Machine manifests with the selected instance type:

// Machines returns a list of machines for a machinepool
func Machines(clusterID string, region string, subnets aws.SubnetsByZone, 
    pool *types.MachinePool, role, userDataSecret string, 
    userTags map[string]string, publicSubnet bool) ([]machineapi.Machine, 
    *machinev1.ControlPlaneMachineSet, error) {
    
    mpool := pool.Platform.AWS
    
    // Create machines for each replica
    for idx := int64(0); idx < total; idx++ {
        zone := mpool.Zones[int(idx)%len(mpool.Zones)]
        subnet, ok := subnets[zone]
        
        // Create provider config with selected instance type
        provider, err := provider(&machineProviderInput{
            clusterID:        clusterID,
            region:           region,
            subnet:           subnet.ID,
            instanceType:     mpool.InstanceType,  // Uses selected instance type
            osImage:          mpool.AMIID,
            zone:             zone,
            role:             role,
            // ... other fields
        })
        
        // Create Machine object
        machine := machineapi.Machine{
            Spec: machineapi.MachineSpec{
                ProviderSpec: machineapi.ProviderSpec{
                    Value: &runtime.RawExtension{Object: provider},
                },
            },
        }
        machines = append(machines, machine)
    }
    
    return machines, controlPlaneMachineSet, nil
}

The provider() function creates the AWS machine provider configuration:

func provider(in *machineProviderInput) (*machineapi.AWSMachineProviderConfig, error) {
    config := &machineapi.AWSMachineProviderConfig{
        TypeMeta: metav1.TypeMeta{
            APIVersion: "machine.openshift.io/v1beta1",
            Kind:       "AWSMachineProviderConfig",
        },
        InstanceType: in.instanceType,  // Set from selected instance type
        // ... other configuration fields
    }
    return config, nil
}

Key Points:

  • Generates Machine manifests for each replica
  • Uses the instance type selected by PreferredInstanceType()
  • Creates AWSMachineProviderConfig with the instance type
  • Distributes machines across availability zones

### Execution Flow Summary

1. User creates install-config → Specifies region (and optionally instance type)
2. Master machine configuration (master.go):

  • If instance type not specified, calls InstanceTypes() to get priority list
  • Calls PreferredInstanceType() to select best available type
    3. Instance type selection (instance_types.go):
  • Queries AWS EC2 API to check availability
  • Returns first type available in all zones
    4. Machine manifest generation (machines.go):
  • Creates Machine objects with selected instance type
  • Writes manifests to disk

## Related Code References

  • Instance type defaults: pkg/types/aws/defaults/platform.go
  • Instance type selection logic: pkg/asset/machines/aws/instance_types.go
  • Machine manifest generation: pkg/asset/machines/aws/machines.go
  • Master machine configuration: pkg/asset/machines/master.go

@liweinan
Copy link
Author

OCP-29648

anan@think:~/works/openshift-versions/works$ cat install-config.yaml.bkup 
additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: qe.devcluster.openshift.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    aws:
      amiID: ami-01095d1967818437c
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    aws:
      amiID: ami-0c1a8e216e46bb60c
  replicas: 3
metadata:
  name: weli-test
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  aws:
    region: us-east-1
    vpc: {}
publish: External

# 查看 master 节点的 AMI(应该看到 ami-0c1a8e216e46bb60c)
echo "Master 节点 AMI:"
aws ec2 describe-instances \
  --region "${REGION}" \
  --filters "Name=tag:kubernetes.io/cluster/${INFRA_ID},Values=owned" \
            "Name=tag:Name,Values=*master*" \
            "Name=instance-state-name,Values=running" \
  --output json | jq -r '.Reservations[].Instances[].ImageId' | sort | uniq

# 查看 worker 节点的 AMI(应该看到 ami-01095d1967818437c)
echo "Worker 节点 AMI:"
aws ec2 describe-instances \
  --region "${REGION}" \
  --filters "Name=tag:kubernetes.io/cluster/${INFRA_ID},Values=owned" \
            "Name=tag:Name,Values=*worker*" \
            "Name=instance-state-name,Values=running" \
  --output json | jq -r '.Reservations[].Instances[].ImageId' | sort | uniq
Master 节点 AMI:
ami-0c1a8e216e46bb60c
Worker 节点 AMI:
ami-01095d1967818437c

@liweinan
Copy link
Author

OCP-21531

Verify the Pull Secret:

anan@think:~/works/openshift-versions/421nightly$ vi ../auth.json
anan@think:~/works/openshift-versions/421nightly$ oc adm release extract --command openshift-install --from=registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804 -a ../auth.json 
anan@think:~/works/openshift-versions/421nightly$ du -h openshift-install 
654M	openshift-install

Export variables:

anan@think:~/works/openshift-versions/work3$ export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804
anan@think:~/works/openshift-versions/work3$ export OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE=ami-01095d1967818437c

Using different version installer to install the cluster:

anan@think:~/works/openshift-versions/work3$ ../421rc0/openshift-install version
../421rc0/openshift-install 4.21.0-rc.0
built from commit 8f88b34924c2267a2aa446dcdc6ccdd5260f9c45
release image quay.io/openshift-release-dev/ocp-release@sha256:ecde621d6f74aa1af4cd351f8b571ca2a61bbc32826e49cdf1b7fbff07f04ede
WARNING Found override for release image (registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804). Release Image Architecture is unknown 
release architecture unknown
default architecture amd64
anan@think:~/works/openshift-versions/work3$ ../421rc0/openshift-install create cluster
WARNING Found override for release image (registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804). Release Image Architecture is unknown 
INFO Credentials loaded from the "default" profile in file "/home/anan/.aws/credentials" 
WARNING Found override for OS Image. Please be warned, this is not advised 
INFO Successfully populated MCS CA cert information: root-ca 2035-12-23T03:35:54Z 2025-12-25T03:35:54Z 
INFO Successfully populated MCS TLS cert information: root-ca 2035-12-23T03:35:54Z 2025-12-25T03:35:54Z 
INFO Credentials loaded from the AWS config using "SharedConfigCredentials: /home/anan/.aws/credentials" provider 
WARNING Found override for release image (registry.ci.openshift.org/ocp/release:4.21.0-0.nightly-2025-12-22-170804). Please be warned, this is not advised 

Check the installed cluster version and the used amiID:

anan@think:~/works/openshift-versions/work3$ export KUBECONFIG=/home/anan/works/openshift-versions/work3/auth/kubeconfig
anan@think:~/works/openshift-versions/work3$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.21.0-0.nightly-2025-12-22-170804   True        False         71m     Cluster version is 4.21.0-0.nightly-2025-12-22-170804
$ oc get machineset.machine.openshift.io -n openshift-machine-api -o json | \
jq -r '.items[] | .spec.template.spec.providerSpec.value.ami.id'
ami-01095d1967818437c
ami-01095d1967818437c
ami-01095d1967818437c
ami-01095d1967818437c
ami-01095d1967818437c

@liweinan
Copy link
Author

OCP-22425

OCP-22425

Cluster A:

anan@think:~/works/openshift-versions/work3$ oc get nodes
NAME                           STATUS   ROLES                  AGE   VERSION
ip-10-0-106-174.ec2.internal   Ready    control-plane,master   8h    v1.34.2
ip-10-0-157-14.ec2.internal    Ready    control-plane,master   8h    v1.34.2
ip-10-0-30-65.ec2.internal     Ready    worker                 8h    v1.34.2
ip-10-0-54-54.ec2.internal     Ready    worker                 8h    v1.34.2
ip-10-0-74-122.ec2.internal    Ready    worker                 8h    v1.34.2
ip-10-0-76-206.ec2.internal    Ready    control-plane,master   8h    v1.34.2
anan@think:~/works/openshift-versions/work3$ oc get route -n openshift-authentication
NAME              HOST/PORT                                                    PATH   SERVICES          PORT   TERMINATION            WILDCARD
oauth-openshift   oauth-openshift.apps.weli-test.qe.devcluster.openshift.com          oauth-openshift   6443   passthrough/Redirect   None
anan@think:~/works/openshift-versions/work3$ oc get po -n openshift-apiserver
NAME                         READY   STATUS    RESTARTS   AGE
apiserver-6b767844c6-2jztv   2/2     Running   0          8h
apiserver-6b767844c6-g4rck   2/2     Running   0          8h
apiserver-6b767844c6-jzv4z   2/2     Running   0          8h
anan@think:~/works/openshift-versions/work3$ oc rsh -n openshift-apiserver apiserver-6b767844c6-2jztv
Defaulted container "openshift-apiserver" out of: openshift-apiserver, openshift-apiserver-check-endpoints, fix-audit-permissions (init)
sh-5.1# 

Cluster B:

anan@think:~/works/openshift-versions/works2$ oc get nodes
NAME                           STATUS   ROLES                  AGE   VERSION
ip-10-0-122-6.ec2.internal     Ready    control-plane,master   27m   v1.34.2
ip-10-0-134-89.ec2.internal    Ready    control-plane,master   27m   v1.34.2
ip-10-0-141-244.ec2.internal   Ready    worker                 13m   v1.34.2
ip-10-0-31-52.ec2.internal     Ready    worker                 21m   v1.34.2
ip-10-0-67-21.ec2.internal     Ready    control-plane,master   27m   v1.34.2
ip-10-0-96-196.ec2.internal    Ready    worker                 21m   v1.34.2
anan@think:~/works/openshift-versions/works2$ oc get po -n openshift-apiserver
NAME                         READY   STATUS    RESTARTS   AGE
apiserver-574bdcd758-j85sh   2/2     Running   0          10m
apiserver-574bdcd758-l98ph   2/2     Running   0          10m
apiserver-574bdcd758-p922j   2/2     Running   0          8m8s
anan@think:~/works/openshift-versions/works2$ oc rsh -n openshift-apiserver apiserver-574bdcd758-j85sh
Defaulted container "openshift-apiserver" out of: openshift-apiserver, openshift-apiserver-check-endpoints, fix-audit-permissions (init)
sh-5.1# curl -k https://oauth-openshift.apps.weli-test.qe.devcluster.openshift.com/healthz
oksh-5.1# 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment