Skip to content

Instantly share code, notes, and snippets.

@jmchilton
Created February 13, 2026 12:46
Show Gist options
  • Select an option

  • Save jmchilton/cfde23ddb41211cc490b424575054af9 to your computer and use it in GitHub Desktop.

Select an option

Save jmchilton/cfde23ddb41211cc490b424575054af9 to your computer and use it in GitHub Desktop.
CWL Validated Runtime Plan: Tools and Workflows

Problem

Galaxy's CWL integration (branch cwl-1.0 at common-workflow-lab/galaxy) translates CWL's schema-based parameter model into Galaxy's opinionated tool parameter model (basic.py). This round-trip — CWL schema to Galaxy widgets to user input to Galaxy param_dict to CWL job JSON — is the root of the problem. It requires:

  • Every CWL type mapped to a Galaxy widget type (TYPE_REPRESENTATIONS)
  • Union types encoded as Galaxy conditionals with synthetic _cwl__type_/_cwl__value_ keys
  • A catch-all FieldTypeToolParameter added to basic.py
  • Reverse-engineering Galaxy DatasetWrappers back into CWL File objects (to_cwl_job(), galactic_flavored_to_cwl_job() in representation.py)

This adaption layer is hack after hack and deeply entangles CWL with Galaxy's core parameter infrastructure.

Target Pattern: YAML Tool Runtime

Galaxy's YAML-defined tools already demonstrate the architecture we want for CWL. They use a typed state transformation chain:

RequestToolState (API)
    → decode() → RequestInternalToolState
    → dereference() → RequestInternalDereferencedToolState
    → expand() → JobInternalToolState (persisted on job)
    → runtimeify() → JobRuntimeToolState (File objects with paths)

Each transition is schema-validated and unit-testable. The runtimeify() step converts dataset references ({src: "hda", id: N}) into CWL-style File objects with real paths — all driven by the tool's parameter model, not by reverse-engineering Galaxy wrappers.

YAML tools bypass Galaxy's legacy parameter parsing entirely: they set has_galaxy_inputs = False, use UserToolEvaluator, and build commands from validated state via JavaScript/CWL expressions.

Goal

Migrate CWL tool execution to follow this same pattern:

  1. Tool request API entry: CWL tools accept requests via POST /api/jobs with CWL-native parameters. Galaxy provides a client layer translating dataset references (not raw file paths) into the API.
  2. Validated tool state: The input flows through the typed state chain. JobInternalToolState is persisted on the job with dataset references.
  3. CWL-specific runtimeify: At evaluation time, convert JobInternalToolState into CWL job inputs (File objects with paths, secondary files, format URIs) — analogous to what YAML tools do but richer to satisfy cwltool's expectations.
  4. cwltool via JobProxy: Pass the runtimeified inputs to JobProxy, which delegates command building, file staging, and output collection to cwltool. This delegation is inherent to CWL — cwltool is the authoritative command builder — and differs structurally from YAML tools (which evaluate expressions directly).

What This Eliminates

  • The representation.py layer (to_cwl_job, galactic_flavored_to_cwl_job, dataset_wrapper_to_file_json)
  • CWL-specific parameter types in basic.py (FieldTypeToolParameter, TYPE_REPRESENTATIONS)
  • Galaxy's parameter expansion/population machinery for CWL tools (expand_meta_parameters, _populate_async)
  • The to_cwl() fallback path in workflow/modules.py for CWL tool execution

Ideally, the CWL branch modifies basic.py not at all.

What Remains Different from YAML Tools

CWL tools will not be identical to YAML tools. Key structural differences:

  • Command pre-computation: CWL delegates command building to cwltool at exec_before_job time via JobProxy. YAML tools evaluate expressions at _build_command_line time.
  • Output collection: CWL uses a post-execution relocate_dynamic_outputs.py script because cwltool runs its own output glob evaluation. YAML tools use Galaxy's standard metadata.
  • File staging: CWL uses cwltool's PathMapper for symlinks (needed for InitialWorkDirRequirement). YAML tools use Galaxy's compute_environment.input_path_rewrite().
  • Evaluator class: CWL uses ToolEvaluator (not UserToolEvaluator) since commands are pre-computed, not built from expressions.

Open Questions (with preliminary research — not yet reviewed)

1. Where should dataset→File conversion happen for CWL?

Answer: Both phases — generic runtimeify then CWL-specific enrichment.

Currently exec_before_job receives validated_tool_state.input_state containing raw {src: "hda", id: N} references and passes them directly to JobProxy. But JobProxy._normalize_job() (parser.py:376-391) expects File objects with path or location keys — a structural mismatch.

The two-phase approach:

  1. Generic runtimeify() (convert.py:539-577) with setup_for_runtimeify() (runtime.py:50-123) converts raw dataset refs into DataInternalJson File objects with class, path, basename, format, size, etc. This is the same typed conversion YAML tools use — reusable, unit-testable.

  2. CWL-specific enrichment in exec_before_job adds what cwltool needs beyond basic File objects: secondaryFiles, CWL format URIs, checksums, and anything else JobProxy._normalize_job() expects.

The generic phase can be called at exec_before_job time since inp_data is available there. This follows the YAML pattern while keeping CWL-specific concerns isolated.

2. How to handle secondary files in the new path?

Answer: Extend DataInternalJson and discover secondary files in adapt_dataset().

The current DataInternalJson (tool_util_models/parameters.py:594-612) has no secondaryFiles field — it's explicitly commented out. This is the key blocker.

Secondary files are stored in Galaxy's object store at {dataset.extra_files_path}/__secondary_files__/ with an ordering index at __secondary_files_index.json (util.py:41). The legacy dataset_wrapper_to_file_json() (representation.py:163-183) discovers them at execution time by enumerating this directory.

The fix:

  1. Add secondaryFiles: Optional[List["DataInternalJson"]] to DataInternalJson
  2. Enhance adapt_dataset() in setup_for_runtimeify() (runtime.py:77-94) to check the HDA's extra_files_path for __secondary_files__/, read the index, and attach them to the File object

This keeps secondary file logic centralized in the same callback that builds the primary File object, avoids duplication between YAML and CWL paths, and integrates naturally with cwltool's expectations (it processes secondaryFiles via visit_class() in _normalize_job()).

3. Can output collection move inside Galaxy's job finishing?

Answer: No — it must stay as an appended script. Several reasons:

  • cwltool needs the working directory: job_proxy.collect_outputs() (parser.py:495-506) invokes cwltool's native collect_outputs() which evaluates output glob patterns at runtime, directly on the filesystem where outputs were created. Galaxy's discover_outputs() (tools/init.py:2871-2916) uses pre-computed metadata — fundamentally different.

  • Compute node locality: The .cwl_job.json file and working directory live on the compute node. Moving collection to JobWrapper.finish() on the controller would require transferring both, plus having cwltool available on the controller.

  • CWL-specific metadata model: handle_outputs() (runtime_actions.py:69-230) handles secondaryFiles with custom index files, Directory outputs, CWL format URIs, and ExpressionTool JS execution — none of which Galaxy's discover_outputs() understands.

  • Container execution: The relocate script runs after the container exits but within the same job script, with working directory still accessible. This mirrors how Galaxy's standard metadata handling works outside containers (command_factory.py:163-165).

The appended script is the correct architectural choice. CWL output collection is fundamentally a cwltool responsibility requiring working directory context.

CWL Runtime Plan: Tools and Workflows

Plumb CWL tool execution — both direct invocation and workflow steps — through Galaxy's validated tool state chain and runtimeify infrastructure. Eliminate the representation.py adaption layer, FieldTypeToolParameter, and all CWL-specific code in basic.py.

Branch: cwl_on_tool_request_api_2


Architecture

RequestToolState (API / workflow step)
    → decode() → RequestInternalToolState
    → dereference() → RequestInternalDereferencedToolState
    → expand() → JobInternalToolState (persisted on job)
    → runtimeify() → JobRuntimeToolState (CWL File objects with paths)
    → exec_before_job() → JobProxy → cwltool

Tool-invoked and workflow-invoked CWL tools converge at JobInternalToolState. The only difference is how that state is constructed:

  • Tool request API: Standard decode → dereference → expand chain
  • Workflow step: build_cwl_input_dict() constructs the dict directly from step connections, bypassing Galaxy's legacy parameter system entirely

Step 1: CWL-Specific Runtimeify Infrastructure

Build the CWL-enriched runtimeify callback and models that both tool and workflow paths need.

1a: CwlDataInternalJson model

Where: lib/galaxy/tool_util_models/parameters.py, after DataInternalJson

class CwlSecondaryFileJson(StrictModel):
    class_: Annotated[Literal["File", "Directory"], Field(alias="class")]
    path: str
    basename: str

class CwlDataInternalJson(DataInternalJson):
    """DataInternalJson extended with CWL-specific fields."""
    secondaryFiles: Optional[List[CwlSecondaryFileJson]] = None

CwlDataInternalJson extends DataInternalJson with optional secondaryFiles. Since the field is optional, existing validators accept it.

1b: CWL runtimeify setup

Where: lib/galaxy/tools/cwl_runtime.py (new file)

def setup_for_cwl_runtimeify(app, compute_environment, input_datasets, input_dataset_collections=None):
    """CWL-enriched version of setup_for_runtimeify."""
    hda_references, base_adapt_dataset, adapt_collection = setup_for_runtimeify(
        app, compute_environment, input_datasets, input_dataset_collections
    )

    hdas_by_id = {d.id: d for d in input_datasets.values() if d is not None}

    def adapt_dataset(value):
        base_result = base_adapt_dataset(value)
        hda = hdas_by_id.get(value.id)
        if hda is None:
            return base_result

        result_dict = base_result.model_dump(by_alias=True)

        # Secondary files
        secondary_files = discover_secondary_files(hda, compute_environment)
        if secondary_files:
            result_dict["secondaryFiles"] = secondary_files

        # CWL format URI (replace Galaxy extension with EDAM URI)
        if hasattr(hda, 'cwl_formats') and hda.cwl_formats:
            result_dict["format"] = str(hda.cwl_formats[0])

        return CwlDataInternalJson(**result_dict)

    return hda_references, adapt_dataset, adapt_collection

1c: discover_secondary_files

def discover_secondary_files(hda, compute_environment=None):
    """Discover secondary files from {extra_files_path}/__secondary_files__/."""
    extra_files_path = hda.extra_files_path
    secondary_files_dir = os.path.join(extra_files_path, SECONDARY_FILES_EXTRA_PREFIX)

    if not os.path.exists(secondary_files_dir):
        return []

    secondary_files = []
    for name in os.listdir(secondary_files_dir):
        sf_path = os.path.join(secondary_files_dir, name)
        real_path = os.path.realpath(sf_path)
        is_dir = os.path.isdir(real_path)

        entry = {
            "class": "Directory" if is_dir else "File",
            "path": compute_environment.input_path_rewrite(sf_path) if compute_environment else sf_path,
            "basename": name,
        }
        secondary_files.append(entry)

    return secondary_files

1d: Make runtimeify recognize CWL file parameters

The runtimeify visitor checks isinstance(parameter, DataParameterModel). CWL file parameters use CwlFileParameterModel / CwlDirectoryParameterModel which extend BaseGalaxyToolParameterModelDefinition, not DataParameterModel.

Fix: Extend the visitor's to_runtime_callback in convert.py to also check for CWL file/directory parameter types, or make those types inherit from DataParameterModel.

1e: Move raw_to_galaxy() to cwl_runtime.py

raw_to_galaxy() creates deferred HDAs from CWL File dicts. Currently in basic.py. Move it to cwl_runtime.py so basic.py cleanup is unblocked. The workflow path needs it for from_cwl() valueFrom expression results.


Step 2: Tool-Only Runtimeify

Wire runtimeify into ToolEvaluator.set_compute_environment() for CWL tools invoked via the tool request API.

2a: Call runtimeify in the evaluator

Where: evaluation.py, ToolEvaluator.set_compute_environment(), in the param_dict_style == "regular" branch between state reconstruction and execute_tool_hooks.

internal_tool_state = None
if job.tool_state:
    internal_tool_state = JobInternalToolState(job.tool_state)
    internal_tool_state.validate(self.tool, f"{self.tool.id} (job internal model)")

# Runtimeify for CWL tools
if internal_tool_state is not None and self.tool.tool_type in CWL_TOOL_TYPES:
    from galaxy.tool_util.parameters.convert import runtimeify
    from galaxy.tools.cwl_runtime import setup_for_cwl_runtimeify

    hda_references, adapt_dataset, adapt_collection = setup_for_cwl_runtimeify(
        self.app, compute_environment, inp_data, input_dataset_collections
    )
    internal_tool_state = runtimeify(
        internal_tool_state, self.tool, adapt_dataset, adapt_collection
    )

2b: exec_before_job receives runtimeified state

Where: CwlCommandBindingTool.exec_before_job() at tools/__init__.py

After runtimeify, .input_state contains CWL File objects:

# Before runtimeify:
{"input_file": {"src": "hda", "id": 42}}

# After runtimeify:
{"input_file": {
    "class": "File",
    "path": "/galaxy/datasets/000/dataset_42.dat",
    "basename": "reads.fastq",
    "format": "http://edamontology.org/format_1930",
    "size": 1048576,
    "secondaryFiles": [
        {"class": "File", "path": "...__secondary_files__/reads.fastq.idx", "basename": "reads.fastq.idx"}
    ]
}}

Widen the type signature: execute_tool_hooks() and exec_before_job() accept Optional[Union[JobInternalToolState, JobRuntimeToolState]]. Both expose .input_state.

2c: _normalize_job() is unchanged

JobProxy._normalize_job() still runs fill_in_defaults() (CWL defaults) and pathToLoc (convert path→location). Both are no-ops or safe on pre-runtimeified inputs. No changes needed.


Step 3: Workflow CWL Input Dict Construction

Build CWL input dicts directly from step connections, bypassing Galaxy's legacy parameter system (FieldTypeToolParameter, visit_input_values callback, etc.).

3a: build_cwl_input_dict()

Where: lib/galaxy/workflow/modules.py, new function

def build_cwl_input_dict(
    step: WorkflowStep,
    progress: WorkflowProgress,
    trans,
) -> dict[str, Any]:
    """Build a CWL-native input dict from step connections.

    Values are:
    - HDA inputs:  {"src": "hda", "id": N}
    - HDCA inputs: {"src": "hdca", "id": N}
    - Scalars:     raw values (int, str, float, bool, None)
    - expression.json HDAs: parsed JSON content (for non-data connections)
    """
    cwl_input_dict = {}

    for input_name, connections in step.input_connections_by_name.items():
        if len(connections) == 1:
            replacement = progress.replacement_for_connection(connections[0])
        else:
            replacement = progress.replacement_for_input_connections(
                step, _input_dict_for_name(input_name, step), connections
            )

        if isinstance(replacement, NoReplacement):
            continue

        cwl_input_dict[input_name] = _galaxy_to_cwl_ref(replacement)

    # Fill defaults from step inputs
    for step_input in step.inputs:
        name = step_input.name
        if name not in cwl_input_dict:
            if step_input.default_value is not None:
                cwl_input_dict[name] = _resolve_default(
                    step_input.default_value, trans, progress
                )

    return cwl_input_dict

3b: _galaxy_to_cwl_ref()

def _galaxy_to_cwl_ref(value):
    """Convert Galaxy model objects to CWL input dict references."""
    if isinstance(value, model.HistoryDatasetAssociation):
        if value.ext == "expression.json":
            with open(value.get_file_name()) as f:
                return json.load(f)
        return {"src": "hda", "id": value.id}
    elif isinstance(value, model.HistoryDatasetCollectionAssociation):
        return {"src": "hdca", "id": value.id}
    elif isinstance(value, model.DatasetCollectionElement):
        return {"src": "dce", "id": value.id}
    else:
        return value

expression.json handling: ExpressionTools produce expression.json datasets. When downstream steps consume these as scalars, parse the JSON content. If the downstream input is actually File-typed, we may need to check the CWL parameter model — fix if/when a test breaks.

3c: Wire into ToolModule.execute()

Where: ToolModule.execute() around line 2476

if tool.tool_type in CWL_TOOL_TYPES and not tool.has_galaxy_inputs:
    # Bypass legacy parameter system for CWL tools
    cwl_input_dict = build_cwl_input_dict(step, progress, trans)
    cwl_input_dict = evaluate_cwl_value_from_expressions(step, cwl_input_dict, progress, trans)

    internal_tool_state = JobInternalToolState(cwl_input_dict)
    internal_tool_state.validate(tool, f"{tool.id} (workflow step)")

    collections_to_match, collection_info = find_cwl_scatter_collections(step, cwl_input_dict, trans)

    if collections_to_match.has_collections():
        # Scatter: expand to multiple param combinations
        param_combinations, validated_param_combinations = expand_scatter(
            cwl_input_dict, collections_to_match, tool
        )
    else:
        collection_info = None
        param_combinations = [cwl_input_dict]
        validated_param_combinations = [internal_tool_state]

    mapping_params = MappingParameters(
        param_template=cwl_input_dict,
        param_combinations=param_combinations,
        validated_param_template=None,
        validated_param_combinations=validated_param_combinations,
    )

    execute(trans=trans, tool=tool, mapping_params=mapping_params, ...)
else:
    # Legacy Galaxy tool path
    visit_input_values(tool_inputs, execution_state.inputs, callback, ...)
    ...

Step 4: valueFrom Expression Evaluation

Evaluate CWL valueFrom JavaScript expressions against the CWL input dict.

4a: evaluate_cwl_value_from_expressions()

def evaluate_cwl_value_from_expressions(
    step: WorkflowStep,
    cwl_input_dict: dict,
    progress: WorkflowProgress,
    trans,
) -> dict[str, Any]:
    """Evaluate CWL valueFrom expressions. Modifies cwl_input_dict in place."""
    value_from_map = {}
    for step_input in step.inputs:
        if step_input.value_from:
            value_from_map[step_input.name] = step_input.value_from

    if not value_from_map:
        return cwl_input_dict

    # Convert refs to CWL format for JS evaluation
    hda_references = []
    step_state = {}
    for key, value in cwl_input_dict.items():
        step_state[key] = _ref_to_cwl(value, hda_references, trans, step)

    # Evaluate each valueFrom expression
    for key, value_from in value_from_map.items():
        context = step_state.get(key)
        result = do_eval(value_from, step_state, context=context)
        cwl_input_dict[key] = _cwl_result_to_ref(result, hda_references, progress, trans)

    return cwl_input_dict

4b: Helpers

def _ref_to_cwl(value, hda_references, trans, step):
    """Convert {src, id} ref to CWL format for JS expression evaluation."""
    if isinstance(value, dict) and "src" in value:
        if value["src"] == "hda":
            hda = trans.sa_session.get(model.HistoryDatasetAssociation, value["id"])
            return to_cwl(hda, hda_references=hda_references, step=step)
        elif value["src"] == "hdca":
            hdca = trans.sa_session.get(model.HistoryDatasetCollectionAssociation, value["id"])
            return to_cwl(hdca, hda_references=hda_references, step=step)
    return value

def _cwl_result_to_ref(value, hda_references, progress, trans):
    """Convert CWL expression result back to {src, id} ref."""
    result = from_cwl(value, hda_references=hda_references, progress=progress)
    if isinstance(result, model.HistoryDatasetAssociation):
        return {"src": "hda", "id": result.id}
    elif isinstance(result, model.HistoryDatasetCollectionAssociation):
        return {"src": "hdca", "id": result.id}
    return result

4c: when_expression adaption

when expressions use the same conversion — build CWL format from input dict refs, evaluate the boolean expression. Use _ref_to_cwl() for conversion.


Step 5: Scatter / Collection Mapping

CWL scatter maps to Galaxy's implicit collection mapping.

5a: Identify scatter inputs

def find_cwl_scatter_collections(
    step: WorkflowStep,
    cwl_input_dict: dict,
    trans,
) -> tuple[CollectionsToMatch, Any]:
    collections_to_match = CollectionsToMatch()

    for step_input in step.inputs:
        name = step_input.name
        scatter_type = step_input.scatter_type or "dotproduct"

        if scatter_type == "disabled" or name not in cwl_input_dict:
            continue

        ref = cwl_input_dict[name]
        if not isinstance(ref, dict) or ref.get("src") not in ("hdca", "dce"):
            continue

        if ref["src"] == "hdca":
            hdca = trans.sa_session.get(model.HistoryDatasetCollectionAssociation, ref["id"])
            if hdca and hdca.collection.allow_implicit_mapping:
                collections_to_match.add(name, hdca)

    return collections_to_match

5b: Expand param_combinations for scatter

def expand_scatter(cwl_input_dict, collections_to_match, tool):
    matched_collections = dataset_collection_manager.match_collections(collections_to_match)

    param_combinations = []
    validated_param_combinations = []

    for iteration_elements in matched_collections.slice_collections():
        slice_dict = dict(cwl_input_dict)
        for name, element in iteration_elements.items():
            slice_dict[name] = _galaxy_to_cwl_ref(element.element_object)

        param_combinations.append(slice_dict)
        state = JobInternalToolState(slice_dict)
        state.validate(tool, f"{tool.id} (scatter slice)")
        validated_param_combinations.append(state)

    return param_combinations, validated_param_combinations

Start with single-variable dotproduct scatter. Multi-variable and nested scatter can come later — existing tests only cover single-variable.


Step 6: Input/Output Dataset Associations

6a: Input dataset registration

The _execute() function creates JobToInputDatasetAssociation entries by walking the params dict looking for HDA objects. With the new path, params has {src: "hda", id: N} refs, not objects.

Fix: Add CWL-specific input association creation. After job creation, iterate over job.tool_state, find {src: "hda", id: N} refs, and create associations. Or pre-resolve refs to objects before calling execute.

6b: Output dataset creation

Created by DefaultToolAction.execute() based on tool.outputs. Independent of input format — should work unchanged.


Step 7: Remove Legacy CWL Parameter Infrastructure

7a: Remove exec_before_job fallback

# Remove this:
if validated_tool_state is not None:
    input_json = validated_tool_state.input_state
else:
    input_json = self.param_dict_to_cwl_inputs(param_dict, local_working_directory)

# Replace with:
assert validated_tool_state is not None, "CWL tools require validated_tool_state"
input_json = validated_tool_state.input_state

7b: Dead code removal

  • param_dict_to_cwl_inputs() on CWL tool classes
  • to_cwl_job(), galactic_flavored_to_cwl_job(), dataset_wrapper_to_file_json() in representation.py
  • TYPE_REPRESENTATIONS dict in representation.py
  • FieldTypeToolParameter in basic.py
  • CWL-specific Conditional/Repeat handling in representation.py
  • The visit_input_values callback's isinstance(input, FieldTypeToolParameter) checks in modules.py

7c: Clean basic.py

Remove FieldTypeToolParameter class, raw_to_galaxy() (now in cwl_runtime.py), CWL imports, TYPE_REPRESENTATIONS references.

Result: basic.py has zero CWL-specific code.


Data Flow Summary

Tool Request API Path

POST /api/jobs (CWL params)
    → decode/dereference/expand
    → JobInternalToolState persisted as job.tool_state
    → ToolEvaluator.set_compute_environment()
        → reconstruct JobInternalToolState from job.tool_state
        → runtimeify(state, tool, cwl_adapt_dataset, adapt_collection)
            → CwlDataInternalJson with secondaryFiles, EDAM format
        → JobRuntimeToolState
    → exec_before_job(validated_tool_state=runtime_state)
        → input_json = runtime_state.input_state
        → JobProxy(input_json, output_dict, job_dir)
        → _normalize_job(): fill_in_defaults, pathToLoc
        → cwltool Job → command_line
    → param_dict["__cwl_command"] → job script execution

Workflow Path

ToolModule.execute(step, progress)
    → build_cwl_input_dict(step, progress, trans)
        → resolve step connections to {src: "hda", id: N} refs
        → parse expression.json for scalar inputs
    → evaluate_cwl_value_from_expressions (if valueFrom)
    → find_cwl_scatter_collections + expand (if scatter)
    → JobInternalToolState(cwl_input_dict)
    → MappingParameters(validated_param_combinations=[state])
    → execute() → job.tool_state = state.input_state
    ─── converges with tool path ───
    → ToolEvaluator → runtimeify → exec_before_job → JobProxy → cwltool

Implementation Order

Step 1: CWL runtimeify infrastructure
    ├── 1a: CwlDataInternalJson + CwlSecondaryFileJson models
    ├── 1b: setup_for_cwl_runtimeify()
    ├── 1c: discover_secondary_files()
    ├── 1d: Make runtimeify recognize CWL file parameters
    └── 1e: Move raw_to_galaxy() to cwl_runtime.py
                    │
                    ▼
Step 2: Tool-only runtimeify
    ├── 2a: Wire runtimeify into ToolEvaluator
    ├── 2b: Widen exec_before_job type signature
    └── 2c: Verify _normalize_job unchanged
             │
             │  ← Tool-only CWL tests should pass here
             ▼
Step 3: Workflow input dict construction
    ├── 3a: build_cwl_input_dict()
    ├── 3b: _galaxy_to_cwl_ref()
    └── 3c: Wire into ToolModule.execute()
                    │
                    ▼
Step 4: valueFrom expressions
    ├── 4a: evaluate_cwl_value_from_expressions()
    ├── 4b: _ref_to_cwl(), _cwl_result_to_ref()
    └── 4c: when_expression adaption
                    │
                    ▼
Step 5: Scatter support
    ├── 5a: find_cwl_scatter_collections()
    └── 5b: expand_scatter()
             │
             │  ← Workflow CWL tests should pass here
             ▼
Step 6: Input/output associations
    ├── 6a: Input dataset registration fix
    └── 6b: Verify output creation unchanged
                    │
                    ▼
Step 7: Remove legacy + clean basic.py
    ├── 7a: Remove exec_before_job fallback
    ├── 7b: Dead code removal
    └── 7c: Clean basic.py

Steps 1-2 get tool-only CWL working through the new path. Steps 3-5 get workflow CWL working. Step 6 handles any association issues that surface during testing. Step 7 is cleanup once everything works.

Steps 4-5 can be deferred if simple workflow tests pass without valueFrom/scatter. Just stub them out and add when tests need them.


New Files

File Contents
lib/galaxy/tools/cwl_runtime.py setup_for_cwl_runtimeify(), discover_secondary_files(), raw_to_galaxy() (moved from basic.py)

Modified Files

File Changes
lib/galaxy/tool_util_models/parameters.py Add CwlSecondaryFileJson, CwlDataInternalJson
lib/galaxy/tool_util/parameters/convert.py Extend to_runtime_callback for CWL parameter types
lib/galaxy/tools/evaluation.py Add runtimeify call in set_compute_environment()
lib/galaxy/tools/__init__.py Widen exec_before_job type signature; eventually remove fallback
lib/galaxy/workflow/modules.py Add build_cwl_input_dict(), CWL branch in ToolModule.execute(), expression helpers
lib/galaxy/tools/parameters/basic.py Remove FieldTypeToolParameter, raw_to_galaxy() (Step 7)
lib/galaxy/tools/cwl/representation.py Dead code removal (Step 7)

Testing Strategy

Unit Tests

  1. discover_secondary_files(): Mock HDA with extra_files_path/__secondary_files__/. Verify File/Directory classification and path resolution.
  2. CWL adapt_dataset: Given HDA with secondary files and CWL format, verify CwlDataInternalJson output.
  3. runtimeify with CWL parameters: CWL parameter model + JobInternalToolState with HDA refs → verify File objects.
  4. build_cwl_input_dict(): Mock step connections and progress. Verify {src, id} refs for HDAs, scalars for params, parsed JSON for expression.json.
  5. evaluate_cwl_value_from_expressions(): Input dict with HDA refs + valueFrom → verify expression sees CWL Files and results convert back.
  6. Scatter expansion: Input dict with HDCA ref + scatter → verify one param combination per element.

Integration Tests

All tests in test_workflows_cwl.py:

Test Exercises
test_simplest_wf Single-step, File I/O
test_load_ids Multi-step subworkflow
test_count_line1_v1 Two-step + ExpressionTool
test_count_line1_v1_json JSON job input
test_count_line2_v1 Different wiring
test_count_lines3_v1 Collection input → scatter → expression.json
test_count_lines4_v1 Multi-input + collection output
test_count_lines4_json JSON job input
test_scatter_wf1_v1 Explicit CWL scatter

Test Gaps

  • valueFrom expressions in workflow steps
  • Secondary files between workflow steps
  • when expressions
  • Subworkflow execution

Unresolved Questions

  1. CWL parameters as DataParameterModel? CwlFileParameterModel extends BaseGalaxyToolParameterModelDefinition, not DataParameterModel. The runtimeify visitor won't recognize CWL file params without a fix (Step 1d). Best approach — extend visitor, or make CWL types inherit DataParameterModel?

  2. Directory inputs? CWL directories stored as tar archives (ext directory). Legacy code untars into _inputs dir. Where does untar happen in new path? Needs class: "Directory" + listing, not class: "File".

  3. Collection inputs (non-scatter)? {src: "hdca", id: N} through runtimeify — adapt_collection has NotImplementedError cases. Do CWL tools receive non-scatter collection inputs?

  4. Format stripping still needed? ToolProxy.__init__ strips format from CWL schema (parser.py:143-148). If we now provide EDAM format URIs, does cwltool validate them against the stripped schema? Mismatch risk.

  5. compute_environment for secondary file paths? Does input_path_rewrite() work on extra_files_path subdirectories or only on HDA file names?

  6. Expression tools in runtimeify? ExpressionTools have no command line (return ["true"]). Runtimeify still runs but the result flows to JobProxy.save_job() for output collection. Verify this path.

  7. Symlink staging for secondary files? Legacy code symlinks primary + secondary into _inputs dir for basename-relative refs. With runtimeify, paths point to extra_files_path/__secondary_files__/. Does cwltool's PathMapper.stage_files() handle this, or do we need explicit symlinking?

  8. input_dataset_collections wiring? set_compute_environment needs input_dataset_collections. job.io_dicts() returns (inp_data, out_data, out_collections) but not input collections directly. How does job.input_dataset_collections map to the format setup_for_runtimeify expects?

  9. CwlFileParameterModel.pydantic_template() acceptance? Does it accept {src: "hda", id: N} for job_internal state representation? py_type = DataRequest. Verify DataRequest matches the dict format.

  10. How does _execute() create JobToInputDatasetAssociation? Does it walk params for HDA objects, or use validated_param_combination? If the former, CWL input dict with {src, id} refs won't work without a fix.

  11. MappingParameters.validated_param_template — does execute() require it to be non-None, or does only validated_param_combinations matter?

CWL Legacy Branch Deep Dive

Research document covering the CWL integration branch (cwl-1.0 rebased onto Galaxy dev) and the recent WIP commits migrating to the tool request API. Written from code analysis of branch cwl_on_tool_request_api_2.

Table of Contents

  1. Branch Structure
  2. Architecture Overview
  3. Key Directories and Files
  4. Tool Loading and Proxy Layer
  5. Parameter Handling (The Hack Layer)
  6. Tool Execution Flow
  7. Output Collection
  8. Workflow Integration
  9. Test Infrastructure
  10. The 4 New Commits (Tool Request API Migration)
  11. What's NOT Covered

Branch Structure

The branch has ~52 commits from the legacy cwl-1.0 branch (rebased onto Galaxy dev post release_26.0) plus 4 new WIP commits that begin migrating CWL tool execution to the modern tool request API.

Legacy commits (~48): Implement CWL tool/workflow parsing, parameter translation, execution via cwltool, conformance test infrastructure, output collection, and many bug fixes.

New commits (4):

d2f9a20b36  WIP: by-pass legacy Galaxy parameter handling for CWL tools
d968749217  Type error...
d4d68d2a9b  Fix persisting CWL tools for tool requests
c290f52d83  WIP: migrate CWL tool running to tool request API

Architecture Overview

The CWL integration wraps the reference CWL runner (cwltool) inside Galaxy's tool framework. The core design pattern is a proxy layer that adapts CWL concepts to Galaxy concepts:

CWL Tool Description (.cwl file)
    ↓ cwltool parses
ToolProxy (wraps cwltool.process.Process)
    ↓ adapted to
Galaxy Tool (CwlTool/GalacticCwlTool extends Tool)
    ↓ parameters adapted via
Galaxy Parameter System (basic.py FieldTypeToolParameter, conditionals, repeats)
    ↓ reverse-converted at execution via
to_cwl_job() / galactic_flavored_to_cwl_job()
    ↓ fed to
JobProxy (wraps cwltool.job.Job)
    ↓ extracts
Shell command + environment + file staging
    ↓ executed by
Galaxy job runner (standard execution pipeline)
    ↓ outputs collected by
handle_outputs() → relocate_dynamic_outputs.py

The fundamental problem (from PROBLEM_AND_GOAL.md): CWL has flexible, schema-based parameters. Galaxy has opinionated, inflexible tool parameters. The legacy branch adapted CWL → Galaxy parameters → back to CWL, which required extensive hacking. The new commits bypass this round-trip.


Key Directories and Files

Primary CWL Implementation: lib/galaxy/tool_util/cwl/

File Lines Purpose
__init__.py 21 Public API exports: tool_proxy, workflow_proxy, to_cwl_job, to_galaxy_parameters, handle_outputs
cwltool_deps.py 148 Optional dependency wrapper for cwltool/schema_salad/ruamel.yaml imports
schema.py 110 SchemaLoader class — loads CWL documents via cwltool's loading pipeline
parser.py 1263 Core module: ToolProxy, JobProxy, WorkflowProxy and all step/input proxy classes
representation.py 589 Galaxy↔CWL parameter mapping: to_cwl_job(), to_galaxy_parameters(), galactic_flavored_to_cwl_job()
util.py 720 Client-side utilities: galactic_job_json() (API→CWL), output_to_cwl_json() (Galaxy→CWL), file upload targets
runtime_actions.py 232 Post-execution output collection: handle_outputs()
runnable.py 33 Lightweight output discovery for CWL artifacts

Galaxy Tool Classes: lib/galaxy/tools/__init__.py

CWL tool hierarchy (around line 3754):

Tool (base)
  └── CwlCommandBindingTool    # Abstract base for CWL tools
        ├── CwlTool             # tool_type = "cwl"
        └── GalacticCwlTool     # tool_type = "galactic_cwl"

Also relevant: ExpressionTool — Galaxy's expression tool, separate from CWL expressions.

Parameter Hack: lib/galaxy/tools/parameters/basic.py

  • FieldTypeToolParameter (line ~2907) — CWL "field" type, the catch-all parameter that handles CWL's flexible typing within Galaxy's parameter system.

CWL Tool Parser: lib/galaxy/tool_util/parser/cwl.py

  • CwlToolSource — Implements Galaxy's ToolSource interface for CWL tools
  • CwlPageSource, CwlInputSource — Adapts CWL inputs to Galaxy's input page model

Execution: lib/galaxy/tools/evaluation.py

  • ToolEvaluator — CWL-specific branches for command line building, config files, environment variables

External Entry Point: lib/galaxy_ext/cwl/handle_outputs.py

  • relocate_dynamic_outputs() — Called by job scripts post-execution to collect CWL outputs

Tool Loading and Proxy Layer

Schema Loading (schema.py)

Two global SchemaLoader instances:

  • schema_loader — strict, validating (for tool loading)
  • non_strict_non_validating_schema_loader — lenient (for job execution)

Loading pipeline:

SchemaLoader.raw_process_reference(path)
    → cwltool.load_tool.fetch_document()
    → RawProcessReference(loading_context, process_object, uri)
        ↓
SchemaLoader.process_definition(raw_ref)
    → cwltool.load_tool.resolve_and_validate_document()
    → ResolvedProcessDefinition
        ↓
SchemaLoader.tool(process_def)
    → cwltool.load_tool.make_tool()
    → cwltool.process.Process

Tool Proxy Hierarchy (parser.py)

ToolProxy (abstract)
├── CommandLineToolProxy  (_class = "CommandLineTool")
└── ExpressionToolProxy   (_class = "ExpressionTool")

ToolProxy wraps a cwltool.process.Process and provides:

  • input_fields() — CWL input record schema fields
  • input_instances() → list of InputInstance (Galaxy-adapted input metadata)
  • output_instances() → list of OutputInstance
  • job_proxy(input_dict, output_dict, job_directory)JobProxy
  • to_persistent_representation() / from_persistent_representation() — serialization for DB storage
  • requirements — extracts CWL requirements/hints
  • docker_identifier() — extracts DockerRequirement image

Key hack: _hack_cwl_requirements() moves DockerRequirement from requirements to hints so Galaxy's own container system handles containerization instead of cwltool.

InputInstance (Simplified by New Commits)

Before new commits: InputInstance had input_type, collection_type, array, area attributes and a complex to_dict() producing Galaxy form widgets with conditionals and selects.

After new commits: InputInstance is stripped to just name, label, description. The function _outer_field_to_input_instance() no longer maps CWL types to Galaxy widget types.

Persistent Representation

Tools serialize to JSON for database storage:

{
  "class": "CommandLineTool",
  "raw_process_reference": { /* full CWL document */ },
  "tool_id": "tool_name",
  "uuid": "uuid-string"
}

Dynamic Tool Registration

CWL tools are registered as "dynamic tools" in Galaxy. In tools/__init__.py (line ~676), when loading a dynamic tool with a CWL proxy:

tool_source = CwlToolSource(tool_proxy=dynamic_tool.proxy)

For workflow-embedded tools, tools are created from persistent representations stored in the database.


Parameter Handling (The Hack Layer)

This is the part the migration aims to eliminate. The legacy approach:

CWL → Galaxy Parameter Mapping (representation.py)

CWL type system is mapped to Galaxy's parameter types:

CWL Type Galaxy Representation Galaxy Widget
File DATA DataToolParameter
Directory DATA DataToolParameter
string TEXT TextToolParameter
int/long INTEGER IntegerToolParameter
float/double FLOAT FloatToolParameter
boolean BOOLEAN BooleanToolParameter
array DATA_COLLECTION (list) DataCollectionToolParameter
record DATA_COLLECTION (record) DataCollectionToolParameter
enum TEXT or SELECT SelectToolParameter
Any/union FIELD or CONDITIONAL FieldTypeToolParameter or Conditional
null (no param)

Union types are the worst offender — a CWL input like [null, File, int] gets mapped to a Galaxy conditional with a select dropdown (_cwl__type_) to pick the active type, and a nested value input (_cwl__value_).

FieldTypeToolParameter (basic.py:2907)

The field type is the catch-all CWL parameter in Galaxy's parameter system:

class FieldTypeToolParameter(ToolParameter):
    def from_json(self, value, trans, other_values=None):
        # Handles: None, dicts with "src", File class dicts, raw values
        if value.get("class") == "File":
            return raw_to_galaxy(trans.app, trans.history, value)
        return self.to_python(value, trans.app)

    def to_json(self, value, app, use_security):
        # Serializes: None, dicts with src/id, File class dicts, raw values

This parameter type handles the kitchen-sink nature of CWL inputs but is inherently hacky — it's trying to encode arbitrary structured data within Galaxy's parameter framework.

Reverse Conversion: Galaxy → CWL (representation.py)

to_cwl_job(tool, param_dict, local_working_directory) — The main legacy conversion path:

  1. Walks tool.inputs (Galaxy's parsed parameter tree)
  2. For repeat inputs → builds CWL arrays
  3. For conditional inputs → reads _cwl__type_ discriminator, extracts _cwl__value_
  4. For data inputs → calls dataset_wrapper_to_file_json() which creates CWL File objects with location, size, checksum, secondary files
  5. For data_collection → collection_wrapper_to_array() or collection_wrapper_to_record()
  6. For primitives → direct type conversion

galactic_flavored_to_cwl_job(tool, param_dict, local_working_directory) — Simpler variant for tools with gx:Interface extensions.

to_galaxy_parameters(tool, as_dict) — Reverse: CWL job outputs → Galaxy tool state. Used when workflow steps receive CWL results.

The Problem

The round-trip (CWL schema → Galaxy widgets → user input → Galaxy param_dict → CWL job JSON) loses type fidelity, requires extensive special-casing, and touches basic.py which is core Galaxy infrastructure. Every CWL type needs Galaxy UI representation, serialization/deserialization, and reverse-mapping logic.


Tool Execution Flow

Legacy Path (pre-migration)

POST /api/tools  (run_tool_raw)
    ↓
Tool.execute() → exec_before_job()
    ↓
CwlCommandBindingTool.exec_before_job(param_dict):
    1. param_dict_to_cwl_inputs(param_dict)  ← reverse-engineers CWL from Galaxy params
    2. Creates JobProxy(input_json, output_dict, job_dir)
    3. Extracts: command_line, stdin, stdout, stderr, environment
    4. cwl_job_proxy.stage_files()  ← symlinks input files
    5. cwl_job_proxy.save_job()     ← writes .cwl_job.json
    6. Stores in param_dict:
        __cwl_command = "shell command string"
        __cwl_command_state = {args, stdin, stdout, stderr, env}
    ↓
ToolEvaluator builds command:
    if tool_type in CWL_TOOL_TYPES and "__cwl_command" in param_dict:
        command_line = param_dict["__cwl_command"]  # pre-generated, no Cheetah
    ↓
Job script runs on compute node:
    1. Execute __cwl_command
    2. python relocate_dynamic_outputs.py
    ↓
handle_outputs(job_directory):
    1. Load JobProxy from .cwl_job.json
    2. job_proxy.collect_outputs()  ← cwltool collects outputs
    3. Move files/directories to Galaxy dataset paths
    4. Write galaxy.json metadata

New Path (tool request API)

POST /api/jobs  (tool_request_raw)
    ↓
Tool.handle_input_async() with has_galaxy_inputs=False:
    - SKIPS expand_meta_parameters_async()
    - SKIPS _populate_async()
    - Passes raw input state through as-is
    ↓
Celery task (serializes tool via persistent representation)
    ↓
CwlCommandBindingTool.exec_before_job(validated_tool_state):
    input_json = validated_tool_state.input_state  ← direct, no reverse-engineering
    ... rest same as legacy

JobProxy Internals (parser.py:329-569)

JobProxy wraps cwltool's job execution:

class JobProxy:
    def __init__(self, tool_proxy, input_dict, output_dict, job_directory):
        self._tool_proxy = tool_proxy
        self._input_dict = input_dict      # CWL-format job inputs
        self._output_dict = output_dict    # Maps output names → dataset paths
        self._job_directory = job_directory

_normalize_job(): Fills CWL defaults via process.fill_in_defaults(), converts "path" → "location".

_ensure_cwl_job_initialized(): Creates cwltool RuntimeContext with:

  • outdir = {job_dir}/working
  • tmpdir = {job_dir}/cwltmp
  • stagedir = {job_dir}/cwlstagedir
  • use_container=False (Galaxy handles containers)

Then calls cwl_tool.job() to get the cwltool.job.Job object.

Key properties:

  • command_line — list of shell fragments (replaces GALAXY_SLOTS sentinel)
  • stdin, stdout, stderr — I/O redirection
  • environment — env vars dict
  • generate_files — InitialWorkDirRequirement files

stage_files(): Uses cwltool's PathMapper to create symlinks for input files.

collect_outputs(tool_working_directory, rcode): Calls cwl_job.collect_outputs() for CommandLineTools, or executes JavaScript for ExpressionTools.

ToolEvaluator CWL Specifics (evaluation.py)

CWL_TOOL_TYPES = ("galactic_cwl", "cwl")
  • Command line (line 808): Uses pre-generated __cwl_command instead of Cheetah template
  • Config files (line 849): Returns empty list (CWL tools don't use config files)
  • Environment variables (line 873): Reads from __cwl_command_state["env"]

Output Collection

Post-Execution Pipeline (runtime_actions.py)

handle_outputs(job_directory) runs after the CWL tool completes:

  1. Loads JobProxy from .cwl_job.json
  2. Reads cwl_params.json for job metadata location
  3. Calls job_proxy.collect_outputs() to get CWL output dict
  4. For each output, dispatches by type:
CWL Output Type Galaxy Action
File Copy to dataset path, handle secondaryFiles
Directory Copy tree to extra_files_path
Array of Files Create dataset list collection
Record Create dataset record collection
Scalar/JSON Write to expression.json
Literal (_: prefix) Write inline content to file

Secondary Files

Stored in dataset_X_files/__secondary_files__/ with an index:

// __secondary_files_index.json
{ "order": ["file1.idx", "file2.bai"] }

Reconstructed during input conversion by dataset_wrapper_to_file_json().

Galaxy Metadata

Written to galaxy.json:

{
  "output_name": {
    "created_from_basename": "result.txt",
    "ext": "data",
    "format": "http://edamontology.org/format_XXXX"
  }
}

Workflow Integration

WorkflowProxy (parser.py:571-759)

Wraps cwltool.workflow.Workflow and converts CWL workflows to Galaxy's internal format.

Key methods:

  • step_proxies() — returns ToolStepProxy or SubworkflowStepProxy per step
  • tool_reference_proxies() — collects all tool definitions recursively (for dynamic registration)
  • input_connections_by_step() — maps CWL step output sources to Galaxy connection format
  • to_dict() — converts entire CWL workflow to Galaxy workflow dict format

Step Proxy Hierarchy

BaseStepProxy
├── ToolStepProxy          # CommandLineTool/ExpressionTool step
│   └── tool_proxy         # ToolProxy for the embedded tool
└── SubworkflowStepProxy   # Nested Workflow step
    └── subworkflow_proxy  # WorkflowProxy for the sub-workflow

InputProxy (parser.py:971-1016)

Represents a workflow step input connection:

  • input_name — CWL input field name
  • cwl_source_id — source reference (step/output)
  • scatter — boolean
  • to_dict() — produces Galaxy input dict with merge_type, scatter_type, value_from

CWL → Galaxy Workflow Conversion

WorkflowProxy.to_dict() produces:

{
    "name": "workflow_name",
    "steps": {
        0: {"type": "data_input", "label": "input1", ...},  # CWL input
        1: {"type": "tool", "tool_uuid": "...", "input_connections": {...}},  # Tool step
        2: {"type": "subworkflow", "subworkflow": {...}},   # Sub-workflow step
    },
    "annotation": "..."
}

CWL workflow inputs map to Galaxy data_input/data_collection_input/parameter_input steps depending on type.

Scatter Support

CWL scatter is mapped via InputProxy:

  • scatter_type = "dotproduct" | "flat_crossproduct" (from scatterMethod)
  • Scatter inputs identified by checking step's scatter field

Galaxy Workflow Manager Integration (managers/workflows.py)

When importing a CWL workflow:

  1. Creates workflow_proxy from CWL dict/path
  2. Extracts tool_reference_proxies — all tools used
  3. Registers each as a dynamic tool (with UUID)
  4. Converts to Galaxy dict format via workflow_proxy.to_dict()
  5. Passes to Galaxy's standard workflow import

Test Infrastructure

Test Files

  • lib/galaxy_test/api/test_tools_cwl.py — API tests for CWL tool execution

    • class TestCwlTools(ApiTestCase) — test class
    • Tests for various CWL features (int, file, array, record, etc.)
    • _run_and_get_stdout(), _run() helper methods
  • lib/galaxy_test/api/cwl/ — CWL-specific test utilities

  • test/functional/tools/cwl_tools/ — CWL tool definitions used in tests

    • v1.0_custom/ — custom CWL v1.0 test tools
    • Conformance test data

CWL Populator (lib/galaxy_test/base/populators.py)

CWL_TOOL_DIRECTORY = "test/functional/tools/cwl_tools"

CwlPopulator class:

  • run_cwl_job(tool_id, job, history_id) — submits CWL tool via API
    • New path: calls tool_request_raw() (tool request API)
    • Old path: called run_tool_raw() (legacy /api/tools)
  • Extracts tool_request_id from response
  • Returns CwlToolRun

CwlToolRun class:

  • Wraps tool request response
  • Properties: job_id, output, output_collection
  • Waits for tool request completion before accessing outputs

Conformance Tests

  • scripts/cwl_conformance_to_test_cases.py — converts CWL conformance tests to Galaxy test format
  • scripts/update_cwl_conformance_tests.sh — updates conformance test suite
  • conformance_tests_gen() — generator that loads conformance YAML and yields test cases

The 4 New Commits (Tool Request API Migration)

Commit 1: c290f52d83 — "WIP: migrate CWL tool running to tool request API"

Foundational commit switching test infrastructure from /api/tools to /api/jobs tool request API.

Changes:

  • CwlToolRun.__init__ now takes tool_request_id instead of run_response
  • CwlPopulator.run_cwl_job() calls tool_request_raw() instead of run_tool_raw()
  • Added test_cwl_int_simple test
  • Added cwl_int.cwl to sample_tool_conf.xml

Commit 2: d4d68d2a9b — "Fix persisting CWL tools for tool requests"

Tool request API uses Celery tasks, so CWL tools must serialize/deserialize correctly.

Changes:

  • ToolProxy.__init__ accepts and stores tool_id
  • to_persistent_representation() includes tool_id
  • from_persistent_representation() reads tool_id back
  • QueueJobs schema gains tool_id: str and tool_uuid: Optional[UUID]
  • JobsService populates these from tool metadata
  • Celery task chain passes tool_id/tool_uuid through to create_tool_from_representation()
  • Unit tests verify round-trip serialization preserves UUID and tool_id

Commit 3: d968749217 — "Type error..."

Preparatory cleanup stripping 55 lines of Galaxy type-mapping logic from _outer_field_to_input_instance().

Commit 4: d2f9a20b36 — "WIP: by-pass legacy Galaxy parameter handling for CWL tools"

The architecturally significant commit:

A. has_galaxy_inputs flag (tools/__init__.py):

self.has_galaxy_inputs = False  # Set True only when pages.inputs_defined

For CWL tools, inputs_style="cwl" on PagesSource means has_galaxy_inputs stays False.

B. Bypass parameter machinery (handle_input_async):

if self.has_galaxy_inputs:
    expanded_incomings, job_tool_states, collection_info = expand_meta_parameters_async(...)
    params, errors = self._populate_async(request_context, expanded_incoming)
else:
    # CWL path: pass state through directly
    expanded_incomings = [deepcopy(tool_request_internal_state.input_state)]
    params = expanded_incoming
    errors = {}

C. Thread validated_tool_state to exec_before_job:

# OLD:
input_json = self.param_dict_to_cwl_inputs(param_dict, local_working_directory)

# NEW:
input_json = validated_tool_state.input_state

D. CWL input pages simplified:

  • parse_input_pages() always returns CWL-style pages (no Galaxy interface overlay)
  • InputInstance stripped to name/label/description only

Migration Summary

Aspect Legacy Path New Path
API endpoint /api/tools /api/jobs (tool request API)
Parameter handling CWL→Galaxy widgets→param_dict→CWL Raw JSON passed through
CWL job inputs param_dict_to_cwl_inputs(param_dict) validated_tool_state.input_state
Tool serialization Not needed (same process) Celery: ToolProxy.to_persistent_representation()
Input form rendering Full Galaxy form with types has_galaxy_inputs=False, no form
Expansion/validation expand_meta_parameters_async() + _populate_async() Bypassed entirely

What's NOT Covered

This research document has limitations due to context window constraints:

Partially Covered

  • representation.py line-by-line: Covered the key functions but not every helper or edge case in the 589-line file
  • util.py internals: Covered the main functions (galactic_job_json, output_to_cwl_json) but not every upload target class or collection handling detail
  • parser.py JobProxy: Covered the main methods but not every normalization step or edge case in _normalize_job()

Not Covered

  • Client-side CWL support: Any Vue/TypeScript components for CWL tool forms in client/
  • Galaxy model changes: CWL-related changes to model/ (DynamicTool model, tool state storage)
  • Job wrapper changes: lib/galaxy/jobs/__init__.pyis_cwl_job property and CWL-specific job handling details
  • Command factory: lib/galaxy/jobs/command_factory.py — how CWL job scripts differ from standard Galaxy job scripts (relocate_dynamic_outputs.py generation)
  • CWL v1.1/v1.2 differences: The branch supports v1.0, v1.1, and v1.2 but version-specific handling wasn't analyzed
  • Workflow run.py changes: CWL-specific workflow execution modifications (raw_to_galaxy for File outputs in workflow steps)
  • Galaxy controller changes: webapps/galaxy/controllers/tool_runner.py — CWL tool type allowlisting
  • Conformance test coverage: Specific CWL conformance test pass/fail status, which tests are "red" vs "green"
  • basic.py full diff: The complete set of modifications to basic.py for CWL support (beyond FieldTypeToolParameter)
  • Expression tool JavaScript execution: How CWL ExpressionTool JS evaluation works inside Galaxy
  • CWL format/EDAM mapping: How CWL format URIs map to Galaxy datatypes
  • Error handling: CWL validation error propagation, failed job handling
  • Galaxy config for CWL: strict_cwl_validation, enable_cwl flags, tool_conf setup
  • The galactic_cwl tool type: How gx:Interface extensions work in detail, what map_to does
  • Scatter execution: How CWL scatter actually maps to Galaxy's collection-based parallelism
  • value_from expressions: How CWL valueFrom JavaScript expressions are evaluated during workflow execution

CWL Legacy Runtime Deep Dive

How CWL tools execute inside Galaxy's job infrastructure — from API request to command execution to output collection. Written to inform the migration toward a runtimeify-style approach using the tool request API and validated tool state.

Branch: cwl_on_tool_request_api_2


Table of Contents

  1. Executive Summary
  2. YAML Tool Runtime (The Target Pattern)
  3. CWL Tool Execution: End-to-End
  4. Phase 1: API Entry and State Handling
  5. Phase 2: Job Creation and Persistence
  6. Phase 3: Job Preparation and Evaluation
  7. Phase 4: exec_before_job — The CWL Core
  8. Phase 5: JobProxy — cwltool Bridge
  9. Phase 6: Command Assembly and Job Script
  10. Phase 7: Output Collection
  11. The Representation Layer (Legacy Hack)
  12. Comparison: CWL vs YAML Tool Runtime
  13. What a CWL Runtimeify Would Look Like
  14. Unresolved Questions

Executive Summary

CWL tool execution uses a fundamentally different architecture from YAML tools:

  • YAML tools use UserToolEvaluator with param_dict_style="json". The evaluator calls runtimeify() to convert JobInternalToolState into a JobRuntimeToolState with CWL-style File objects. Command building uses do_eval() (JavaScript/CWL expressions) against these inputs. Everything happens inside the evaluator — no special pre-processing hook needed.

  • CWL tools use the standard ToolEvaluator with param_dict_style="regular". The critical work happens in CwlCommandBindingTool.exec_before_job(), which extracts validated_tool_state.input_state, creates a JobProxy wrapping cwltool, and pre-computes the entire command line, stdin/stdout/stderr, and environment variables. These are stashed in param_dict as __cwl_command and __cwl_command_state. The evaluator just uses those pre-computed values verbatim.

The key architectural difference: YAML tools build commands at evaluation time using expressions. CWL tools delegate command building to cwltool via a proxy object and store the result before evaluation even starts.


YAML Tool Runtime (The Target Pattern)

Understanding this is critical because the goal is to make CWL execution follow a similar pattern.

Evaluator Selection

MinimalJobWrapper._get_tool_evaluator() (jobs/__init__.py:1402-1415):

if self.tool.base_command or self.tool.shell_command:
    klass = UserToolEvaluator   # YAML tools
else:
    klass = ToolEvaluator       # Galaxy tools, CWL tools

CWL tools have neither base_command nor shell_command, so they get ToolEvaluator.

UserToolEvaluator.build_param_dict() (evaluation.py:1130-1170)

Two paths based on whether validated_tool_state exists:

New path (runtimeify):

hda_references, adapt_datasets, adapt_collections = setup_for_runtimeify(
    self.app, compute_environment, input_datasets, input_dataset_collections
)
job_runtime_state = runtimeify(validated_tool_state, self.tool, adapt_datasets, adapt_collections)
cwl_style_inputs = job_runtime_state.input_state

Returns: {"inputs": cwl_style_inputs, "outdir": job_working_directory}

UserToolEvaluator._build_command_line() (evaluation.py:1172-1198)

Evaluates base_command + arguments or shell_command using do_eval() against the CWL-style inputs. No Cheetah templates. No pre-computation.

State Transformation Chain

RequestToolState (API) → decode() → RequestInternalToolState
    → dereference() → RequestInternalDereferencedToolState
    → expand() → JobInternalToolState (persisted to job.tool_state)
    → runtimeify() → JobRuntimeToolState (at evaluation time, File objects with paths)

Each transition is typed, validated, and unit-testable.


CWL Tool Execution: End-to-End

POST /api/jobs (tool_request_raw)
    │
    ▼
Tool.handle_input_async()
    │ has_galaxy_inputs=False → bypass expand_meta_parameters, _populate_async
    │ creates JobInternalToolState from raw input
    ▼
execute_async() → _execute()
    │ job.tool_state = execution_slice.validated_param_combination.input_state
    ▼
[Job persisted to DB, queued via Celery]
    │ Tool serialized via ToolProxy.to_persistent_representation()
    ▼
JobWrapper.prepare()
    │ _get_tool_evaluator() → ToolEvaluator (not UserToolEvaluator)
    ▼
ToolEvaluator.set_compute_environment()
    │ Reconstructs JobInternalToolState from job.tool_state
    │ Calls build_param_dict() → returns early for CWL (no output wrapping)
    │ Calls execute_tool_hooks() → exec_before_job(validated_tool_state=...)
    ▼
CwlCommandBindingTool.exec_before_job()
    │ input_json = validated_tool_state.input_state  ← direct access
    │ Creates JobProxy(input_json, output_dict, job_dir)
    │ Extracts: command_line, stdin, stdout, stderr, env
    │ Stages files, saves .cwl_job.json
    │ Sets param_dict["__cwl_command"] and ["__cwl_command_state"]
    ▼
ToolEvaluator.build()
    │ _build_command_line() → uses param_dict["__cwl_command"] verbatim
    │ _build_config_files() → returns empty (CWL tools have no config files)
    │ _build_environment_variables() → reads from __cwl_command_state["env"]
    ▼
build_command() (command_factory.py)
    │ Wraps command with container, dependency resolution
    │ CWL-specific: writes cwl_params.json, generates relocate_dynamic_outputs.py
    ▼
Job script executes on compute node
    │ 1. Run __cwl_command
    │ 2. python relocate_dynamic_outputs.py
    ▼
handle_outputs()
    │ Loads JobProxy from .cwl_job.json
    │ Calls job_proxy.collect_outputs()
    │ Moves files to Galaxy dataset paths
    │ Writes galaxy.json metadata

Phase 1: API Entry and State Handling

Tool.expand_incoming_async() (tools/__init__.py:2157-2223)

The has_galaxy_inputs flag controls whether Galaxy's parameter machinery runs:

if self.has_galaxy_inputs:
    # Galaxy tools: full parameter expansion, validation, population
    expanded_incomings, job_tool_states, collection_info = expand_meta_parameters_async(...)
else:
    # CWL tools: pass state through as-is
    expanded_incomings = [deepcopy(tool_request_internal_state.input_state)]
    job_tool_states = [deepcopy(tool_request_internal_state.input_state)]
    collection_info = None

After expansion, validation still happens:

if self.has_galaxy_inputs:
    params, errors = self._populate_async(request_context, expanded_incoming)
else:
    params = expanded_incoming
    errors = {}

But a JobInternalToolState is always created and validated against the tool's parameter model:

internal_tool_state = JobInternalToolState(job_tool_state)
internal_tool_state.validate(self, f"{self.id} (job internal model)")

has_galaxy_inputs Flag (tools/__init__.py:1725,1734)

Set during parse_inputs():

self.has_galaxy_inputs = False           # line 1725
if pages.inputs_defined:
    self.has_galaxy_inputs = True        # line 1734

For CWL tools, CwlPageSource.inputs_style is "cwl", which means inputs_defined behavior depends on the page source implementation. With the new commits, CWL tools have has_galaxy_inputs = False.


Phase 2: Job Creation and Persistence

execute.py (lib/galaxy/tools/execute.py)

execute_async()_execute()execute_single_job():

# Line 254-256:
if execution_slice.validated_param_combination:
    tool_state = execution_slice.validated_param_combination.input_state
    job.tool_state = tool_state

This persists the JobInternalToolState.input_state dict as JSON on the Job model. For CWL tools using the new path, this is the raw CWL-compatible input dict (dataset references as {src: "hda", id: <int>}).

Celery Serialization

Tool request API uses Celery tasks. CWL tools must round-trip through serialization:

  • ToolProxy.to_persistent_representation() serializes the full CWL tool description
  • QueueJobs schema carries tool_id and tool_uuid
  • On the worker, create_tool_from_representation() reconstructs the tool
  • This was fixed in commit d4d68d2a9b

Phase 3: Job Preparation and Evaluation

JobWrapper.prepare() (jobs/__init__.py:1247-1314)

Called by the job runner when the job is ready to execute:

tool_evaluator = self._get_tool_evaluator(job)                    # line 1270
tool_evaluator.set_compute_environment(compute_environment, ...)   # line 1272
(self.command_line, self.version_command_line,
 self.extra_filenames, self.environment_variables,
 self.interactivetools) = tool_evaluator.build()                   # line 1274

Evaluator Selection (jobs/__init__.py:1402-1415)

if self.tool.base_command or self.tool.shell_command:
    klass = UserToolEvaluator   # YAML tools have these
else:
    klass = ToolEvaluator       # CWL tools don't

CWL tools always get ToolEvaluator (not UserToolEvaluator).

ToolEvaluator.set_compute_environment() (evaluation.py:166-243)

Reconstructs validated state from the persisted job:

# Lines 217-220:
internal_tool_state = None
if job.tool_state:
    internal_tool_state = JobInternalToolState(job.tool_state)
    internal_tool_state.validate(self.tool, f"{self.tool.id} (job internal model)")

Then calls hooks with the validated state:

self.execute_tool_hooks(inp_data=inp_data, out_data=out_data,
                        incoming=incoming, validated_tool_state=internal_tool_state)

Which calls:

self.tool.exec_before_job(self.app, inp_data, out_data, self.param_dict,
                          validated_tool_state=validated_tool_state)

ToolEvaluator.build_param_dict() — CWL Branch (evaluation.py:263-285)

CWL tools get a plain dict (not TreeDict) and return early:

if self.tool.tool_type == "cwl":
    param_dict: Union[dict[str, Any], TreeDict] = self.param_dict
else:
    param_dict = TreeDict(self.param_dict)

# ... populate wrappers, input dataset wrappers ...

if self.tool.tool_type == "cwl":
    # don't need the outputs or the sanitization
    param_dict["__local_working_directory__"] = self.local_working_directory
    return param_dict

Skips: output dataset wrapping, output collection wrapping, non-job params, sanitization.


Phase 4: exec_before_job — The CWL Core

CwlCommandBindingTool.exec_before_job() (tools/__init__.py:3757-3829)

This is where CWL-specific execution setup happens. Full annotated flow:

def exec_before_job(self, app, inp_data, out_data, param_dict=None,
                    validated_tool_state=None):
    super().exec_before_job(...)
    local_working_directory = param_dict["__local_working_directory__"]

    # 1. GET INPUT STATE — direct from validated_tool_state (new path)
    input_json = validated_tool_state.input_state

    # 2. BUILD OUTPUT DICT — maps output names to dataset paths
    output_dict = {}
    for name, dataset in out_data.items():
        output_dict[name] = {
            "id": str(getattr(dataset.dataset, dataset.dataset.store_by)),
            "path": dataset.get_file_name(),
        }

    # 3. FILTER INPUT JSON — remove unset optional files and empty strings
    input_json = {k: v for k, v in input_json.items()
                  if not (isinstance(v, dict) and v.get("class") == "File"
                          and v.get("location") == "None")}
    input_json = {k: v for k, v in input_json.items() if v != ""}

    # 4. CREATE JOB PROXY — wraps cwltool
    cwl_job_proxy = self._cwl_tool_proxy.job_proxy(
        input_json, output_dict, local_working_directory)

    # 5. EXTRACT EXECUTION DETAILS FROM CWLTOOL
    cwl_command_line = cwl_job_proxy.command_line   # list of args
    cwl_stdin = cwl_job_proxy.stdin
    cwl_stdout = cwl_job_proxy.stdout
    cwl_stderr = cwl_job_proxy.stderr
    env = cwl_job_proxy.environment

    # 6. ASSEMBLE COMMAND STRING
    command_line = " ".join(
        shlex.quote(arg) if needs_shell_quoting_hack(arg) else arg
        for arg in cwl_command_line
    )
    if cwl_stdin:  command_line += f' < "{cwl_stdin}"'
    if cwl_stdout: command_line += f' > "{cwl_stdout}"'
    if cwl_stderr: command_line += f' 2> "{cwl_stderr}"'

    # 7. STAGE FILES — symlinks for input files + InitialWorkDirRequirement
    tool_working_directory = os.path.join(local_working_directory, "working")
    safe_makedirs(tool_working_directory)
    cwl_job_proxy.stage_files()
    cwl_job_proxy.rewrite_inputs_for_staging()

    # 8. PERSIST JOB PROXY — for output collection later
    cwl_job_proxy.save_job()   # writes .cwl_job.json

    # 9. STASH IN PARAM_DICT — for evaluator to pick up
    param_dict["__cwl_command"] = command_line
    param_dict["__cwl_command_state"] = {
        "args": cwl_command_line,
        "stdin": cwl_stdin,
        "stdout": cwl_stdout,
        "stderr": cwl_stderr,
        "env": env,
    }

Critical observation: The input_json at step 1 is now validated_tool_state.input_state (the new path). In the legacy path, this would have been self.param_dict_to_cwl_inputs(param_dict, local_working_directory) which reverse-engineers CWL inputs from Galaxy's wrapped parameter dict via to_cwl_job() or galactic_flavored_to_cwl_job().

$GALAXY_SLOTS Handling

needs_shell_quoting_hack() exempts $GALAXY_SLOTS from quoting. But there's a deeper hack: cwltool needs a concrete number for ResourceRequirement.coresMin at job-construction time. JobProxy._select_resources() substitutes a sentinel value (1.480231396), and command_line property replaces it back with $GALAXY_SLOTS:

# parser.py:442-449
@property
def command_line(self):
    command_line = self.cwl_job().command_line
    return [fragment.replace(str(SENTINEL_GALAXY_SLOTS_VALUE), "$GALAXY_SLOTS")
            for fragment in command_line]

Phase 5: JobProxy — cwltool Bridge

Constructor (parser.py:329-344)

class JobProxy:
    def __init__(self, tool_proxy, input_dict, output_dict, job_directory):
        self._tool_proxy = tool_proxy
        self._input_dict = input_dict      # CWL job inputs
        self._output_dict = output_dict    # {name: {id, path}}
        self._job_directory = job_directory
        self._final_output = None
        self._ok = True
        self._cwl_job = None
        self._normalize_job()

_normalize_job() (parser.py:376-391)

Prepares input dict for cwltool:

  1. Converts "path" keys to "location" in File/Directory objects
  2. Calls process.fill_in_defaults() to inject CWL defaults
  3. Uses cwltool's visit_class() for recursive path rewriting

_ensure_cwl_job_initialized() (parser.py:354-374)

Lazily creates the cwltool Job object:

job_args = dict(
    basedir=self._job_directory,
    select_resources=self._select_resources,
    outdir=os.path.join(self._job_directory, "working"),
    tmpdir=os.path.join(self._job_directory, "cwltmp"),
    stagedir=os.path.join(self._job_directory, "cwlstagedir"),
    use_container=False,            # Galaxy handles containers
    beta_relaxed_fmt_check=True,
)
runtimeContext = RuntimeContext(job_args)

# Defensive copy to prevent mutations
cwl_tool_instance = copy.copy(self._tool_proxy._tool)
cwl_tool_instance.inputs_record_schema = copy.deepcopy(
    cwl_tool_instance.inputs_record_schema)

self._cwl_job = next(cwl_tool_instance.job(
    self._input_dict, self._output_callback, runtimeContext))
self._is_command_line_job = hasattr(self._cwl_job, "command_line")

Key: use_container=False — Galaxy's own containerization (Docker/Singularity) wraps the command later in build_command(). cwltool must not try to run containers.

Directory Layout

{job_directory}/
├── .cwl_job.json           # Serialized JobProxy (tool + inputs + outputs)
├── cwl_params.json         # {job_metadata, job_id_tag} for output collection
├── cwlstagedir/            # cwltool staging area (symlinks)
├── cwltmp/                 # cwltool temp directory
├── working/                # Tool output directory (outdir)
├── outputs/
│   └── dataset_{id}_files/ # Galaxy extra files per output
│       └── __secondary_files__/  # Secondary files
├── relocate_dynamic_outputs.py   # Generated output collection script
└── tool_script.sh          # Galaxy job script

stage_files() (parser.py:541-564)

Uses cwltool's PathMapper to create symlinks:

if hasattr(cwl_job, "pathmapper"):
    process.stage_files(cwl_job.pathmapper, stageFunc, ignore_writable=True)

if hasattr(cwl_job, "generatefiles"):
    # InitialWorkDirRequirement
    generate_mapper = pathmapper.PathMapper(
        cwl_job.generatefiles["listing"], outdir, outdir, separateDirs=False)
    process.stage_files(generate_mapper, stageFunc, ignore_writable=inplace_update)
    relink_initialworkdir(generate_mapper, outdir, outdir, inplace_update=inplace_update)

save_job() (parser.py:508-516)

Writes .cwl_job.json:

job_objects = {
    "tool_representation": self._tool_proxy.to_persistent_representation(),
    "job_inputs": self._input_dict,
    "output_dict": self._output_dict,
}
json.dump(job_objects, open(job_file, "w"))

This is how the post-execution output collection script can reconstruct the full CWL context.

CommandLineTool vs ExpressionTool

Property CommandLineTool ExpressionTool
is_command_line_job True False
command_line cwl_job.command_line (list of args) ["true"] (no-op)
stdin/stdout/stderr From cwl_job None
environment From cwl_job (EnvVarRequirement) {}
stage_files() Uses pathmapper + generatefiles No pathmapper
collect_outputs() cwl_job.collect_outputs(workdir, rcode) cwl_job.run() → JS execution → _output_callback

Phase 6: Command Assembly and Job Script

ToolEvaluator.build() (evaluation.py)

After exec_before_job has set __cwl_command in param_dict:

_build_command_line() (line 806-809):

if self.tool.tool_type in CWL_TOOL_TYPES and "__cwl_command" in param_dict:
    command_line = param_dict["__cwl_command"]  # Pre-computed, no Cheetah

_build_config_files() (line 849-851):

if self.tool.tool_type in CWL_TOOL_TYPES:
    return config_filenames  # Empty — CWL tools have no config files

_build_environment_variables() (line 873-907):

# Extract CWL env vars from __cwl_command_state
for key, value in param_dict.get("__cwl_command_state", {}).get("env", {}).items():
    environment_variable = dict(name=key, template=value)
    environment_variables_raw.append(environment_variable)

# Later: CWL tools skip Cheetah templating for env vars
if self.tool.tool_type not in CWL_TOOL_TYPES:
    template_type = "cheetah"

build_command() (command_factory.py:39-293)

Assembles the final job script. CWL-specific block at lines 141-158:

if job_wrapper.is_cwl_job:
    # 1. Write cwl_params.json for output collection
    cwl_metadata_params = {
        "job_metadata": join("working", job_wrapper.tool.provided_metadata_file),
        "job_id_tag": job_wrapper.get_id_tag(),
    }
    with open(cwl_metadata_params_path, "w") as f:
        json.dump(cwl_metadata_params, f)

    # 2. Generate relocate script
    relocate_contents = (
        "from galaxy_ext.cwl.handle_outputs import relocate_dynamic_outputs; "
        "relocate_dynamic_outputs()"
    )
    write_script(relocate_script_file, relocate_contents, ...)

    # 3. Append to job script
    commands_builder.append_command(SETUP_GALAXY_FOR_METADATA)
    commands_builder.append_command(f"python '{relocate_script_file}'")

Also at line 289-293, CWL jobs skip the duplicate SETUP_GALAXY_FOR_METADATA before the metadata command since it's already added above.

Resulting Job Script Structure

# 1. Dependency setup (conda, etc.)
# 2. Container setup if needed
# 3. The CWL command itself (__cwl_command)
<cwl_tool_command> < stdin > stdout 2> stderr
# 4. Exit code capture
# 5. Galaxy environment setup
SETUP_GALAXY_FOR_METADATA
# 6. Output relocation
python 'relocate_dynamic_outputs.py'
# 7. Standard metadata commands

Phase 7: Output Collection

Entry Point: relocate_dynamic_outputs.py

Generated by command_factory.py, calls:

from galaxy_ext.cwl.handle_outputs import relocate_dynamic_outputs
relocate_dynamic_outputs()

handle_outputs.py → runtime_actions.py

galaxy_ext/cwl/handle_outputs.py is a thin wrapper that adjusts sys.path and calls galaxy.tool_util.cwl.runtime_actions.handle_outputs().

handle_outputs() (runtime_actions.py:69-229)

Step 1: Load context

job_proxy = load_job_proxy(job_directory, strict_cwl_validation=False)
cwl_metadata_params = json.load(open(cwl_metadata_params_path))
exit_code_file = default_exit_code_file(".", cwl_metadata_params["job_id_tag"])
tool_exit_code = read_exit_code_from(exit_code_file, job_id_tag)

load_job_proxy() (parser.py:798-808) reconstructs the full CWL context:

job_objects = json.load(open(os.path.join(job_directory, ".cwl_job.json")))
cwl_tool = tool_proxy_from_persistent_representation(job_objects["tool_representation"])
return cwl_tool.job_proxy(job_objects["job_inputs"], job_objects["output_dict"], job_directory)

Step 2: Collect CWL outputs

outputs = job_proxy.collect_outputs(tool_working_directory, tool_exit_code)

For CommandLineTools: delegates to cwltool's collect_outputs() which evaluates output glob patterns. For ExpressionTools: calls cwl_job.run() to execute the JavaScript expression.

Step 3: Process each output

CWL Output Type Processing
File (dict with location) move_output() — copies file to Galaxy dataset path, handles secondary files
Directory (dict with location) move_directory() — copies tree to extra_files_path
Record (dict without location) Splits by |__part__| prefix, processes each field
List (array) Creates indexed elements with filenames
Scalar/JSON handle_known_output_json() — writes to expression.json
None/missing Fills with null JSON for declared-but-absent outputs

Step 4: Write galaxy.json

job_metadata = os.path.join(job_directory, cwl_metadata_params["job_metadata"])
with open(job_metadata, "w") as f:
    json.dump(provided_metadata, f)

This galaxy.json contains per-output metadata: created_from_basename, ext, format, and for collections, elements.

Secondary Files

Stored in dataset_{id}_files/__secondary_files__/ with an index file:

{"order": ["file.idx", "file.bai"]}

The move_output() function handles secondary file naming. CWL uses a ^ prefix convention (each ^ removes one extension from the primary file name), but the code also supports STORE_SECONDARY_FILES_WITH_BASENAME mode.


The Representation Layer (Legacy Hack)

This section documents what the migration aims to eliminate. The legacy path converts Galaxy param_dict back to CWL inputs.

to_cwl_job() (representation.py:386-488)

Called by CwlTool.param_dict_to_cwl_inputs(). Walks tool.inputs (Galaxy's parsed parameter tree):

  • Repeat inputs → CWL arrays (strips _repeat suffix)
  • Conditional inputs → reads _cwl__type_ discriminator, extracts _cwl__value_
  • Data inputs → dataset_wrapper_to_file_json() creates CWL File objects
  • Collectionscollection_wrapper_to_array() or collection_wrapper_to_record()
  • Primitives → type-coerced values

galactic_flavored_to_cwl_job() (representation.py:286-383)

Simpler variant for GalacticCwlTool. Uses map_to paths for nested structures. No repeat/conditional handling.

dataset_wrapper_to_file_json() (representation.py:155-195)

Converts a Galaxy DatasetWrapper to CWL File object:

raw_file_object = {
    "class": "File",
    "location": path,
    "size": int(dataset_wrapper.get_size()),
    "format": str(dataset_wrapper.cwl_formats[0]),
    "basename": basename,
    "nameroot": nameroot,
    "nameext": nameext,
    "secondaryFiles": [...]
}

Handles secondary files by symlinking into an _inputs directory.

Why This Is a Problem

The round-trip (CWL schema → Galaxy widgets → user input → Galaxy param_dict → CWL job JSON) requires:

  • Every CWL type mapped to a Galaxy widget type (TYPE_REPRESENTATIONS)
  • Union types become Galaxy conditionals with _cwl__type_/_cwl__value_ keys
  • FieldTypeToolParameter in basic.py as a catch-all for CWL's flexible typing
  • DatasetWrappers must be reverse-engineered back to CWL File objects
  • All of this touches basic.py, which is core Galaxy infrastructure

With validated tool state, the input JSON goes directly to exec_before_job without this reverse-engineering.


Comparison: CWL vs YAML Tool Runtime

Aspect CWL Tool (current) YAML Tool (runtimeify)
Evaluator class ToolEvaluator UserToolEvaluator
param_dict_style "regular" "json"
Input state source validated_tool_state.input_state (new) or param_dict_to_cwl_inputs() (legacy) runtimeify(validated_tool_state)
Dataset → File conversion Done in exec_before_job (input_json already has references) OR dataset_wrapper_to_file_json() (legacy) Done by setup_for_runtimeify() adapters
Command building Pre-computed by cwltool via JobProxy, stored in __cwl_command Built at eval time via do_eval() with CWL expressions
Where command lives param_dict["__cwl_command"] Returned from _build_command_line()
Output collection Post-execution relocate_dynamic_outputs.py script via cwltool Standard Galaxy metadata
Job proxy needed Yes — wraps cwltool.job.Job No
Container handling Galaxy wraps (cwltool use_container=False) Galaxy wraps
File staging cwltool PathMapper (symlinks) Galaxy's standard input staging
Config files None YamlTemplateConfigFile
Environment vars From cwltool (EnvVarRequirement) From tool definition

Key Structural Differences

  1. Command pre-computation: CWL delegates to cwltool at exec_before_job time. YAML tools evaluate at _build_command_line time. This is unavoidable — cwltool is the authoritative CWL command builder.

  2. Two-phase output: CWL uses a post-execution script to collect outputs because cwltool needs to run its own output glob evaluation. YAML tools use Galaxy's standard metadata.

  3. File staging: CWL uses cwltool's PathMapper. YAML tools use Galaxy's standard input path rewriting via compute_environment.input_path_rewrite().

  4. No runtimeify() equivalent: CWL currently gets validated_tool_state.input_state directly. It does NOT go through runtimeify() to convert dataset references to File objects with paths. The input_state either already has the right format (new path) or gets reverse-engineered from param_dict (legacy path).


What a CWL Runtimeify Would Look Like

The goal: make CWL execution use typed state transitions similar to YAML tools.

Current New Path (partially done)

validated_tool_state.input_state  (has dataset refs as {src: "hda", id: N})
    ↓
exec_before_job() filters + passes directly to JobProxy
    ↓
JobProxy._normalize_job() fills CWL defaults
    ↓
cwltool processes and generates command

What's Missing for Full Runtimeify

  1. Dataset reference → File object conversion: Currently exec_before_job receives input_state with raw references. Someone needs to convert {src: "hda", id: N} to {"class": "File", "location": "/path/to/file", ...}. In the YAML path, runtimeify() + setup_for_runtimeify() does this. For CWL, this conversion could happen:

    • Option A: Inside exec_before_job (current approach — it has access to inp_data)
    • Option B: Via a CWL-specific runtimeify() before exec_before_job
    • Option C: Use UserToolEvaluator for CWL tools too (would need base_command or shell_command set)
  2. Secondary files: YAML's runtimeify() doesn't handle secondary files. CWL needs them. dataset_wrapper_to_file_json() currently handles this in the legacy path.

  3. Directory inputs: CWL directories are tar archives in Galaxy. Need extraction logic that the YAML path doesn't have.

  4. Collection mapping: CWL arrays/records map to Galaxy collections. The YAML runtimeify() has adapt_collection but raises NotImplementedError for some cases.

The Input State Question

In the current new path, what does validated_tool_state.input_state look like for CWL tools? It appears to be the raw API input — dataset references but not yet File objects with paths. The conversion to CWL File objects (with location, size, checksum, secondary files) would need to happen somewhere before JobProxy gets the input dict.

The YAML tool path does this in setup_for_runtimeify()adapt_dataset() which creates DataInternalJson objects (CWL File-like). A CWL equivalent would need to be richer — adding secondary files, CWL format URIs, checksums, etc.


Unresolved Questions

  1. Where should dataset→File conversion happen for CWL? In exec_before_job (has inp_data dict), in a CWL-specific runtimeify, or somewhere else? The current code in exec_before_job just uses validated_tool_state.input_state directly — does this already contain resolved paths or just references?

  2. Can CWL tools use UserToolEvaluator? They'd need base_command or shell_command. Could we set a synthetic shell_command that's the cwltool-generated command? Probably not — the command isn't known until JobProxy runs.

  3. How close can file staging get to Galaxy's standard path? CWL uses cwltool's PathMapper for symlinks. YAML uses compute_environment.input_path_rewrite(). Could we skip PathMapper and use Galaxy's rewriting? Probably not for InitialWorkDirRequirement files.

  4. Can output collection move inside Galaxy? Currently it's a post-execution script. Could collect_outputs() run inside Galaxy's job finishing instead of as a script appended to the job? This would avoid needing to serialize the full tool representation to .cwl_job.json.

  5. ExpressionTool execution: These run JS, not shell commands. The current path returns ["true"] as the command and runs the expression during collect_outputs. How does this interact with the tool request API? Is there a simpler path?

  6. What validated_tool_state.input_state actually contains for CWL right now? Need to trace a concrete test case to see the actual JSON structure at each phase. The filtering in exec_before_job (removing location == "None" files and empty strings) suggests the state may not be fully clean yet.

  7. Secondary files in the new path: The legacy dataset_wrapper_to_file_json() reconstructs secondary files from __secondary_files__ directories. In the new path using validated_tool_state.input_state, who provides secondary file information?

CWL Tool Loading and Reference Test Infrastructure

Research document covering how Galaxy loads CWL tools from .cwl files into executable Tool objects, how the CWL reference/conformance test infrastructure works, and how loaded CWL tools interact with the tool request API.

Branch: cwl_on_tool_request_api_2


Table of Contents

  1. Topic 1: How Galaxy Loads CWL Tools
  2. Topic 2: CWL Reference Test Infrastructure
  3. Topic 3: Tool Loading and the Tool Request API

Topic 1: How Galaxy Loads CWL Tools

Overview

The CWL tool loading pipeline transforms a .cwl file into a fully usable Galaxy Tool object through a multi-layered chain:

.cwl file
  -> get_tool_source() (factory.py:105-111)
    -> CwlToolSource (parser/cwl.py:72)
      -> tool_proxy() (cwl/parser.py:761)
        -> SchemaLoader.tool() (cwl/schema.py:94)
          -> cwltool loads & validates CWL document
        -> _cwl_tool_object_to_proxy() (cwl/parser.py:858)
          -> CommandLineToolProxy or ExpressionToolProxy
    -> CwlToolSource.parse_tool_type() returns "cwl" or "galactic_cwl"
  -> create_tool_from_source() (__init__.py:450)
    -> tool_types["cwl"] -> CwlTool class
    -> Tool.__init__() calls tool.parse(tool_source)
      -> CwlCommandBindingTool.parse() stores _cwl_tool_proxy
      -> Tool.parse_inputs() -> input_models_for_pages() -> CWL parameter models

Step 1: File Detection and ToolSource Creation

Entry point: get_tool_source() in lib/galaxy/tool_util/parser/factory.py:64-114

When Galaxy encounters a .cwl (or .json) file, it is identified as a CWL tool:

# factory.py:105-111
elif config_file.endswith(".json") or config_file.endswith(".cwl"):
    uuid = uuid or uuid4()
    return CwlToolSource(config_file, strict_cwl_validation=strict_cwl_validation,
                         tool_id=tool_id, uuid=uuid)

For CWL tools to be recognized by the directory loader, enable_beta_formats must be True:

  • lib/galaxy/tool_util/loader_directory.py:119-150 - looks_like_a_tool() only checks CWL files when enable_beta_formats=True
  • lib/galaxy/tool_util/loader_directory.py:253-277 - _find_tool_files() only searches non-XML files when enable_beta_formats=True
  • lib/galaxy/config/__init__.py:960 - Galaxy config: enable_beta_tool_formats (default: False)
  • Test driver sets enable_beta_tool_formats=True at lib/galaxy_test/driver/driver_util.py:212

Step 2: CwlToolSource and the ToolProxy

File: lib/galaxy/tool_util/parser/cwl.py:72-346

CwlToolSource extends ToolSource (the abstract interface all tool formats implement). It lazily creates a ToolProxy on first access:

# cwl.py:93-116
@property
def tool_proxy(self) -> "ToolProxy":
    if self._tool_proxy is None:
        if self._source_path is not None:
            self._tool_proxy = tool_proxy(
                self._source_path,
                strict_cwl_validation=self._strict_cwl_validation,
                tool_directory=self._tool_directory,
                tool_id=self._tool_id,
                uuid=self._uuid,
            )
        else:
            # From persistent representation (Celery deserialization)
            self._tool_proxy = tool_proxy_from_persistent_representation(
                self._source_object, ...)
    return self._tool_proxy

Key parse methods on CwlToolSource:

Method Returns Notes
parse_tool_type() (line 125) "cwl" or "galactic_cwl" Checks for gx:Interface hint
parse_command() (line 137) "$__cwl_command" Placeholder; real command built by cwltool at exec time
parse_input_pages() (line 223) PagesSource([CwlPageSource], inputs_style="cwl") Creates CWL-style input page
parse_outputs() (line 228) (outputs, output_collections) Delegates to ToolProxy.output_instances()
parse_requirements() (line 305) containers, software reqs Extracts DockerRequirement, SoftwareRequirement, etc.
parse_profile() (line 322) "17.09" Hardcoded CWL profile
to_string() (line 344) JSON string For Celery serialization; calls tool_proxy.to_persistent_representation()

The gx:Interface hint determines tool type:

  • With gx:Interface (e.g., galactic_cat.cwl): tool_type = "galactic_cwl" -> GalacticCwlTool class
  • Without gx:Interface: tool_type = "cwl" -> CwlTool class

Step 3: Schema Loading Pipeline

File: lib/galaxy/tool_util/cwl/schema.py:1-111

The SchemaLoader class wraps cwltool's document loading pipeline:

# schema.py:32-110
class SchemaLoader:
    def __init__(self, strict=True, validate=True):
        self._strict = strict
        self._validate = validate

    def loading_context(self):
        # Creates cwltool LoadingContext with:
        loading_context.strict = self._strict
        loading_context.do_validate = self._validate
        loading_context.enable_dev = True    # allows dev CWL versions
        loading_context.do_update = True
        loading_context.relax_path_checks = True

    def raw_process_reference(self, path):
        # Step 1: Normalize path, create file:// URI
        # Step 2: load_tool.fetch_document(uri, loadingContext)
        # Returns: RawProcessReference(loading_context, process_object, uri)

    def process_definition(self, raw_process_reference):
        # Step 3: resolve_and_validate_document() - full CWL validation
        # Returns: ResolvedProcessDefinition

    def tool(self, **kwds):
        # Step 4: load_tool.make_tool() - creates cwltool Process object
        # Returns: cwltool Process (CommandLineTool or ExpressionTool)

Two singleton instances:

  • schema_loader = SchemaLoader() - strict, validating (line 109)
  • non_strict_non_validating_schema_loader = SchemaLoader(strict=False, validate=False) (line 110)

Step 4: ToolProxy Construction

File: lib/galaxy/tool_util/cwl/parser.py:127-256, 761-879

The tool_proxy() function (line 761) calls _to_cwl_tool_object() (line 811):

# parser.py:811-855
def _to_cwl_tool_object(tool_path=None, tool_object=None, ...):
    schema_loader = _schema_loader(strict_cwl_validation)

    if tool_path is not None:
        # Load from file path
        raw_process_reference = schema_loader.raw_process_reference(tool_path)
        cwl_tool = schema_loader.tool(raw_process_reference=raw_process_reference)
    elif tool_object is not None:
        # Load from dict/YAML object (for persistent representations)
        tool_object = yaml_no_ts().load(json.dumps(tool_object))
        raw_process_reference = schema_loader.raw_process_reference_for_object(tool_object)
        cwl_tool = schema_loader.tool(raw_process_reference=raw_process_reference)

    _hack_cwl_requirements(cwl_tool)  # Galaxy-specific requirement adjustments
    check_requirements(raw_tool)       # Validate supported requirements

    return _cwl_tool_object_to_proxy(cwl_tool, tool_id, uuid, ...)

_cwl_tool_object_to_proxy() (line 858) selects the proxy class based on class:

# parser.py:858-879
def _cwl_tool_object_to_proxy(cwl_tool, tool_id, uuid, ...):
    process_class = raw_tool["class"]
    if process_class == "CommandLineTool":
        proxy_class = CommandLineToolProxy
    elif process_class == "ExpressionTool":
        proxy_class = ExpressionToolProxy
    else:
        raise Exception("File not a CWL CommandLineTool.")
    return proxy_class(cwl_tool, tool_id, uuid, raw_process_reference, tool_path)

ToolProxy base class (line 127-256) provides:

  • job_proxy(input_dict, output_dict, job_directory) (line 150) - creates a JobProxy for execution
  • galaxy_id() (line 162) - derives Galaxy tool ID from CWL id field or UUID
  • to_persistent_representation() (line 199) - serializes for Celery/database storage
  • from_persistent_representation() (line 215) - deserializes
  • requirements / hints_or_requirements_of_class() - CWL requirement access

Constructor (line 130-148) strips format from input fields to prevent cwltool validation errors:

for input_field in self._tool.inputs_record_schema["fields"]:
    if "format" in input_field:
        del input_field["format"]

CommandLineToolProxy (line 258-322) adds:

  • input_fields() (line 278) - reads inputs_record_schema["fields"], resolves schemaDefs
  • input_instances() (line 305) - converts fields to InputInstance objects
  • output_instances() (line 308) - reads outputs_record_schema["fields"]
  • docker_identifier() (line 315) - extracts DockerRequirement

ExpressionToolProxy (line 325) - subclass of CommandLineToolProxy, only changes _class = "ExpressionTool".

Step 5: Input Parameters - From CWL Schema to Galaxy Parameter Models

CWL inputs flow through two parallel systems:

A. Galaxy Legacy Parameters (parse_inputs in __init__.py)

Tool.parse_inputs() at lib/galaxy/tools/__init__.py:1718-1757:

def parse_inputs(self, tool_source):
    self.has_galaxy_inputs = False
    pages = tool_source.parse_input_pages()
    # CwlToolSource returns PagesSource with inputs_style="cwl"
    # PagesSource.inputs_defined returns True (style != "none")
    try:
        parameters = input_models_for_pages(pages, self.profile)
        self.parameters = parameters
    except Exception:
        pass
    if pages.inputs_defined:
        self.has_galaxy_inputs = True  # <-- WAS True for CWL
        # BUT the new branch bypasses this for CWL

Key change on this branch: has_galaxy_inputs is set True because inputs_style="cwl" is not "none". However, the expand_incoming_async() method at line 2183-2191 checks self.has_galaxy_inputs to decide whether to run Galaxy's parameter expansion machinery. When has_galaxy_inputs=False (forced for CWL in the new path), the raw state passes through.

B. CWL Parameter Models (New typed system)

CwlPageSource (parser/cwl.py:366) creates CwlInputSource objects from the tool proxy's input_instances().

These flow into input_models_for_pages() at lib/galaxy/tool_util/parameters/factory.py:453:

def from_input_source(input_source, profile):
    if input_source.input_class == "cwl":   # CwlInputSource.input_class returns "cwl"
        tool_parameter = _from_input_source_cwl(input_source)
    else:
        tool_parameter = _from_input_source_galaxy(input_source, profile)

_from_input_source_cwl() (factory.py:421-436) maps CWL schema-salad types to parameter models:

CWL Type Galaxy Parameter Model parameter_type
int CwlIntegerParameterModel "cwl_integer"
float CwlFloatParameterModel "cwl_float"
string CwlStringParameterModel "cwl_string"
boolean CwlBooleanParameterModel "cwl_boolean"
null CwlNullParameterModel "cwl_null"
org.w3id.cwl.cwl.File CwlFileParameterModel "cwl_file"
org.w3id.cwl.cwl.Directory CwlDirectoryParameterModel "cwl_directory"
[type1, type2, ...] (union) CwlUnionParameterModel "cwl_union"

These models live in lib/galaxy/tool_util_models/parameters.py:1943-2100.

CwlFileParameterModel and CwlDirectoryParameterModel (lines 2061-2088) both use DataRequest as their py_type, meaning the API expects {src: "hda", id: <encoded_id>} for dataset inputs.

Step 6: Tool Class Instantiation

File: lib/galaxy/tools/__init__.py:450-472, 5085-5103

# Line 460-466
elif tool_type := tool_source.parse_tool_type():
    ToolClass = tool_types.get(tool_type)
    if ToolClass is None:
        if tool_type == "cwl":
            raise ToolLoadError("Runtime support for CWL tools is not implemented currently")

# Line 5085-5103 - TOOL_CLASSES list includes:
#   CwlTool,            # tool_type = "cwl"
#   GalacticCwlTool,    # tool_type = "galactic_cwl"
tool_types = {tool_class.tool_type: tool_class for tool_class in TOOL_CLASSES}

Note: The error at line 463-464 fires only if CwlTool is not in TOOL_CLASSES (i.e., on mainline Galaxy without CWL support). On this CWL branch, CwlTool IS in the list.

CWL Tool Class Hierarchy

Tool  (lib/galaxy/tools/__init__.py)
  └── CwlCommandBindingTool  (line 3754)
        ├── GalacticCwlTool   (line 3843, tool_type="galactic_cwl")
        └── CwlTool           (line 3855, tool_type="cwl")

CwlCommandBindingTool (line 3754-3840):

  • exec_before_job() - Creates JobProxy, pre-computes command via cwltool, stages files
  • parse() (line 3831) - Stores _cwl_tool_proxy from tool_source.tool_proxy
  • param_dict_to_cwl_inputs() - Abstract, raises NotImplementedError

CwlTool (line 3855-3873):

  • tool_type = "cwl"
  • may_use_container_entry_point = True
  • param_dict_to_cwl_inputs() - Legacy path via to_cwl_job() (not used in new path)
  • inputs_from_dict() (line 3866) - Translates API payloads between galaxy and cwl representations

GalacticCwlTool (line 3843-3852):

  • tool_type = "galactic_cwl"
  • param_dict_to_cwl_inputs() - Uses galactic_flavored_to_cwl_job() (legacy)

Serialization for Celery

CWL tools serialize/deserialize for Celery task processing:

  1. Serialize: Tool.to_raw_tool_source() (init.py:1799) calls CwlToolSource.to_string() (parser/cwl.py:344), which calls ToolProxy.to_persistent_representation() (cwl/parser.py:199). Returns JSON containing class, raw_process_reference (the raw CWL doc), tool_id, and uuid.

  2. Deserialize: create_tool_from_representation() (init.py:475) calls get_tool_source(tool_source_class="CwlToolSource", raw_tool_source=json_string), which calls build_cwl_tool_source() (factory.py:48), which calls tool_proxy_from_persistent_representation().

  3. The tool_source_class is persisted as "CwlToolSource" (it's type(self.tool_source).__name__).

Supported CWL Requirements

From lib/galaxy/tool_util/cwl/parser.py:82-96:

SUPPORTED_TOOL_REQUIREMENTS = [
    "CreateFileRequirement",
    "DockerRequirement",
    "EnvVarRequirement",
    "InitialWorkDirRequirement",
    "InlineJavascriptRequirement",
    "LoadListingRequirement",
    "ResourceRequirement",
    "ShellCommandRequirement",
    "ScatterFeatureRequirement",
    "SchemaDefRequirement",
    "SubworkflowFeatureRequirement",
    "StepInputExpressionRequirement",
    "MultipleInputFeatureRequirement",
    "CredentialsRequirement",
]

Topic 2: CWL Reference Test Infrastructure

Conformance Test Provisioning: update_cwl_conformance_tests.sh

The CWL conformance test tools are not vendored or submoduled. They are downloaded on-demand by scripts/update_cwl_conformance_tests.sh and not committed to git. This is a two-stage process:

Stage 1: Shell Script Downloads Tools

File: scripts/update_cwl_conformance_tests.sh

For each CWL version (1.0, 1.1, 1.2):

  1. Downloads the official CWL spec repo as a zip from GitHub:

    • v1.0: common-workflow-language/common-workflow-language repo
    • v1.1: common-workflow-language/cwl-v1.1 repo
    • v1.2: common-workflow-language/cwl-v1.2 repo
  2. Extracts into test/functional/tools/cwl_tools/v{version}/:

    • conformance_tests.yaml — the test manifest (different source paths per version: v1.0 uses v1.0/conformance_test_v1.0.yaml, others use root conformance_tests.yaml)
    • The test tools directory — v1.0 copies v1.0/v1.0/ (creating the cwl_tools/v1.0/v1.0/ path that sample_tool_conf.xml references), others copy tests/
  3. Runs scripts/cwl_conformance_to_test_cases.py to generate Python test files

Result directory structure after running:

test/functional/tools/cwl_tools/
├── v1.0/
│   ├── conformance_tests.yaml
│   └── v1.0/                    # actual test tools (cat1-testcli.cwl, bwa-mem-tool.cwl, etc.)
├── v1.0_custom/                 # committed Galaxy-specific CWL test tools
├── v1.1/
│   ├── conformance_tests.yaml
│   └── tests/                   # CWL v1.1 test tools
└── v1.2/
    ├── conformance_tests.yaml
    └── tests/                   # CWL v1.2 test tools

Stage 2: Python Script Generates Test Cases

File: scripts/cwl_conformance_to_test_cases.py

  1. Reads conformance_tests.yaml recursively (following $import references via its own conformance_tests_gen())
  2. For each conformance test entry, generates a pytest method in a TestCwlConformance class:
    @pytest.mark.cwl_conformance
    @pytest.mark.cwl_conformance_v1_0
    @pytest.mark.command_line_tool  # from CWL test tags
    @pytest.mark.green             # or @pytest.mark.red
    def test_conformance_v1_0_cat1(self):
        """Test doc string..."""
        self.cwl_populator.run_conformance_test("v1.0", "Test doc string...")
  3. Tests are marked red (known-failing in Galaxy) or green based on a hardcoded RED_TESTS dict:
    • v1.0: ~30 red tests (mostly scatter/valuefrom/subworkflow/secondary files)
    • v1.1: ~50 red tests (adds timelimit, networkaccess, inplace_update, etc.)
    • v1.2: ~100+ red tests (adds conditionals, v1.2-specific features)
  4. Writes generated test file to lib/galaxy_test/api/cwl/test_cwl_conformance_v{version_simple}.py
  5. The generated test class extends BaseCwlWorkflowsApiTestCase and each method calls self.cwl_populator.run_conformance_test(version, doc) — which looks up the test by doc string in conformance_tests.yaml, stages inputs, runs the tool/workflow, and compares outputs

The generated test files ARE committed; the downloaded tool files are NOT.

Conformance Test Lookup at Runtime

CwlPopulator.run_conformance_test(version, doc) (populators.py:3150):

  1. Calls get_conformance_test(version, doc) which iterates conformance_tests.yaml entries matching by doc field
  2. Each entry has tool (relative .cwl path), job (input JSON), and output (expected output) fields
  3. Resolves tool path relative to the conformance test directory
  4. Stages inputs via stage_inputs() (uploads files referenced in the job JSON)
  5. Runs via _run_cwl_tool_job() (POST /api/jobs) or _run_cwl_workflow_job()
  6. Compares outputs using cwltest.compare.compare()

Test Tool Locations

Location Committed? Purpose
test/functional/tools/parameters/cwl_*.cwl Yes CWL parameter type testing (10 files)
test/functional/tools/cwl_tools/v1.0_custom/ Yes Galaxy-specific CWL test tools (11 files)
test/functional/tools/cwl_tools/v1.0/v1.0/ No — downloaded CWL v1.0 conformance tools
test/functional/tools/cwl_tools/v1.1/tests/ No — downloaded CWL v1.1 conformance tools
test/functional/tools/cwl_tools/v1.2/tests/ No — downloaded CWL v1.2 conformance tools
test/functional/tools/galactic_cat.cwl Yes Galactic (gx:Interface) CWL tool
test/functional/tools/galactic_record_input.cwl Yes Galactic CWL with record inputs
lib/galaxy_test/api/cwl/test_cwl_conformance_v*.py Yes — generated Generated pytest conformance test cases

Unit tests in test/unit/tool_util/test_cwl.py reference paths like v1.0/v1.0/cat1-testcli.cwl — these require update_cwl_conformance_tests.sh to have been run first.

Tool Configuration for Tests

File: test/functional/tools/sample_tool_conf.xml

All test tools are registered in this file. The CWL section (lines 268-287):

<!-- CWL Testing -->
<tool file="parameters/cwl_int.cwl" />
<tool file="cwl_tools/v1.0/v1.0/cat3-tool.cwl" />
<tool file="cwl_tools/v1.0/v1.0/env-tool1.cwl" />
<tool file="cwl_tools/v1.0/v1.0/null-expression1-tool.cwl" />
<tool file="cwl_tools/v1.0/v1.0/null-expression2-tool.cwl" />
<tool file="cwl_tools/v1.0/v1.0/optional-output.cwl" />
<tool file="cwl_tools/v1.0/v1.0/parseInt-tool.cwl" />
<tool file="cwl_tools/v1.0/v1.0/record-output.cwl" />
<tool file="cwl_tools/v1.0/v1.0/sorttool.cwl" />
<tool file="cwl_tools/v1.0_custom/any1.cwl" />
<tool file="cwl_tools/v1.0_custom/cat1-tool.cwl" />
<tool file="cwl_tools/v1.0_custom/cat2-tool.cwl" />
<tool file="cwl_tools/v1.0_custom/cat-default.cwl" />
<tool file="cwl_tools/v1.0_custom/default_path_custom_1.cwl" />
<tool file="cwl_tools/v1.0_custom/index1.cwl" />
<tool file="cwl_tools/v1.0_custom/optional-output2.cwl" />
<tool file="cwl_tools/v1.0_custom/showindex1.cwl" />
<tool file="galactic_cat.cwl" />
<tool file="galactic_record_input.cwl" />

Note: Several entries reference cwl_tools/v1.0/v1.0/*.cwl which do not exist on this branch. These tools would fail to load. Only parameters/cwl_int.cwl, the v1.0_custom/ tools, and the root-level galactic tools exist.

Test Framework Configuration

File: lib/galaxy_test/driver/driver_util.py

Key constants:

  • FRAMEWORK_TOOLS_DIR = os.path.join(GALAXY_TEST_DIRECTORY, "functional", "tools") (line 60)
  • FRAMEWORK_SAMPLE_TOOLS_CONF = os.path.join(FRAMEWORK_TOOLS_DIR, "sample_tool_conf.xml") (line 62)
  • enable_beta_tool_formats=True (line 212) - required for .cwl file loading

Tool conf is resolved at line 177:

tool_conf = os.environ.get("GALAXY_TEST_TOOL_CONF", default_tool_conf)

CWL Parameter Specification Tests

File: test/unit/tool_util/parameter_specification.yml (lines 3946-4196)

Defines validation test cases for CWL parameter types. These test the CwlParameterModel pydantic models:

cwl_int:
  request_valid:
    - parameter: 5
  request_invalid:
    - parameter: "5"   # must be strict int
    - {}               # required
    - parameter: null

cwl_file:
  request_valid:
   - parameter: {src: hda, id: abcdabcd}
  request_invalid:
   - parameter: {src: hda, id: 7}        # id must be encoded
   - parameter: {src: hdca, id: abcdabcd} # hdca not valid for File
   - parameter: null

These are tested by test/unit/tool_util/test_parameter_specification.py.

API Test Infrastructure

File: lib/galaxy_test/api/test_tools_cwl.py

TestCwlTools class runs CWL tools via Galaxy's API. Two execution paths:

  1. Galaxy representation (_run method, line 374): Uses run_tool_payload() which posts to /api/tools with Galaxy-format inputs ({src: "hda", id: ...})

  2. CWL representation (line 54-64): Same endpoint but with inputs_representation="cwl", sending native CWL inputs

  3. CWL job files (via CwlPopulator.run_cwl_job(), line 67-73): Uses tool request API (POST /api/jobs) with CWL job JSON

CwlPopulator

File: lib/galaxy_test/base/populators.py:3019-3178

Key constant:

CWL_TOOL_DIRECTORY = os.path.join(galaxy_root_path, "test", "functional", "tools", "cwl_tools")
# => test/functional/tools/cwl_tools

Methods:

  • run_cwl_job(artifact, job_path, ...) (line 3084): Main entry point. Determines if artifact is tool or workflow, stages inputs via stage_inputs(), then dispatches to _run_cwl_tool_job() or _run_cwl_workflow_job().

  • _run_cwl_tool_job(tool_id, job, history_id) (line 3030): Posts to tool request API via tool_request_raw(). If tool doesn't exist in Galaxy, creates it as a dynamic tool via create_tool_from_path().

  • run_conformance_test(version, doc) (line 3150): Loads conformance test spec, runs the CWL job, and compares outputs using cwltest.compare.compare().

  • get_conformance_test(version, doc) (line 3024): Looks up a test by its doc field from conformance_tests.yaml in the test directory.

Conformance Test Discovery

File: lib/galaxy_test/base/populators.py:320-331

def conformance_tests_gen(directory, filename="conformance_tests.yaml"):
    conformance_tests_path = os.path.join(directory, filename)
    with open(conformance_tests_path) as f:
        conformance_tests = yaml.safe_load(f)
    for conformance_test in conformance_tests:
        if "$import" in conformance_test:
            import_dir, import_filename = os.path.split(conformance_test["$import"])
            yield from conformance_tests_gen(os.path.join(directory, import_dir), import_filename)
        else:
            conformance_test["directory"] = directory
            yield conformance_test

This expects conformance_tests.yaml in each CWL version directory (e.g., test/functional/tools/cwl_tools/v1.0/conformance_tests.yaml). Each test entry has tool, job, output, and doc fields.

Test Categories

Test Type Location Runs Against Requires v1.0/v1.0?
Unit: ToolProxy creation test/unit/tool_util/test_cwl.py cwltool directly Yes
Unit: Parameter validation test/unit/tool_util/test_parameter_specification.py Pydantic models No (uses parameters/ tools)
Unit: Runtime model test/unit/tool_util/test_parameter_cwl_runtime_model.py Galaxy parameter models No (uses Galaxy tools)
API: Tool execution lib/galaxy_test/api/test_tools_cwl.py Running Galaxy server Yes (mostly), some use v1.0_custom
Conformance: CWL spec Via CwlPopulator.run_conformance_test() Running Galaxy server Yes

CWL Test Tool Examples

ExpressionTool (parameters/cwl_int.cwl):

class: ExpressionTool
requirements:
  - class: InlineJavascriptRequirement
cwlVersion: v1.2
inputs:
  parameter:
    type: int
outputs:
  output: int
expression: "$({'output': inputs.parameter})"

CommandLineTool (v1.0_custom/cat1-tool.cwl):

class: CommandLineTool
cwlVersion: v1.0
inputs:
  file1:
    type: File
    inputBinding: {position: 1}
  numbering:
    type: boolean?
    inputBinding: {position: 0, prefix: -n}
baseCommand: cat
outputs: {}

Galactic CWL Tool (galactic_cat.cwl) - with gx:Interface:

class: CommandLineTool
$namespaces:
  gx: "http://galaxyproject.org/cwl#"
hints:
  gx:interface:
    gx:inputs:
      - gx:name: input1
        gx:type: data
        gx:format: 'txt'

Topic 3: Tool Loading and the Tool Request API

How CWL Tools Enter the Tool Request API

CWL tools now use the tool request API (POST /api/jobs) instead of the legacy POST /api/tools path. The flow:

POST /api/jobs (CwlPopulator._run_cwl_tool_job)
  -> lib/galaxy/webapps/galaxy/services/jobs.py
    -> creates ToolRequest model
    -> dispatches Celery task: queue_jobs
      -> JobCreationManager.queue_jobs() (lib/galaxy/managers/jobs.py:2174)
        -> dereference() - converts URIs to HDAs
        -> tool.handle_input_async() - creates Job

Dereference Step

File: lib/galaxy/managers/jobs.py:2129-2172

Before handle_input_async(), the dereferencer converts raw data requests to internal HDA references:

tool_state = RequestInternalToolState(tool_request.request)
return dereference(tool_state, tool, dereference_callback, dereference_collection_callback), new_hdas

For CWL tools, CwlFileParameterModel and CwlDirectoryParameterModel have py_type = DataRequest (which expects {src: "hda", id: <encoded_id>}). The dereference step converts URI-based requests to internal HDA IDs.

handle_input_async for CWL

File: lib/galaxy/tools/__init__.py:2377

After dereference, queue_jobs() calls:

tool.handle_input_async(
    request_context,
    tool_request,
    tool_state,       # RequestInternalDereferencedToolState
    history=target_history,
    use_cached_job=use_cached_jobs,
    rerun_remap_job_id=rerun_remap_job_id,
)

Inside handle_input_async, expand_incoming_async() is called:

# __init__.py:2183-2191
if self.has_galaxy_inputs:
    expanded_incomings, job_tool_states, collection_info = expand_meta_parameters_async(...)
else:
    # CWL tools: pass state through as-is
    expanded_incomings = [deepcopy(tool_request_internal_state.input_state)]
    job_tool_states = [deepcopy(tool_request_internal_state.input_state)]
    collection_info = None

Since CWL tools bypass Galaxy's parameter expansion, the input state passes through unchanged. A JobInternalToolState is created and validated against the tool's CWL parameter models:

internal_tool_state = JobInternalToolState(job_tool_state)
internal_tool_state.validate(self, f"{self.id} (job internal model)")

Job Persistence

File: lib/galaxy/tools/execute.py:254-256

if execution_slice.validated_param_combination:
    tool_state = execution_slice.validated_param_combination.input_state
    job.tool_state = tool_state

The JobInternalToolState.input_state dict is persisted as JSON on the Job model. For CWL tools, this contains the raw CWL-compatible inputs with dataset references as {src: "hda", id: <int>}.

Celery Serialization of CWL Tools

The tool request API dispatches jobs via Celery. The tool itself must be serializable:

File: lib/galaxy/tools/execute.py:326-345

raw_tool_source, tool_source_class = tool.to_raw_tool_source()
# For CWL: tool_source_class = "CwlToolSource"
# raw_tool_source = JSON string of ToolProxy.to_persistent_representation()

On the Celery worker:

# lib/galaxy/celery/tasks.py:83-92
def queue_jobs(tool_id, raw_tool_source, tool_source_class, ...):
    tool = create_tool_from_representation(
        app=app, raw_tool_source=raw_tool_source,
        tool_source_class=tool_source_class  # "CwlToolSource"
    )

This reconstructs the full CWL tool from its persistent representation. Fixed in commit d4d68d2a9b.

Job Preparation and Evaluation

When the job is ready to execute:

  1. Evaluator selection (jobs/__init__.py:1402-1415):

    if self.tool.base_command or self.tool.shell_command:
        klass = UserToolEvaluator   # YAML tools
    else:
        klass = ToolEvaluator       # CWL tools get this
  2. State reconstruction (evaluation.py:217-220):

    if job.tool_state:
        internal_tool_state = JobInternalToolState(job.tool_state)
        internal_tool_state.validate(self.tool, ...)
  3. param_dict construction (evaluation.py:263-276):

    if self.tool.tool_type == "cwl":
        param_dict = self.param_dict  # plain dict, not TreeDict
        # ...
        # Skip output wrapping, sanitization
        param_dict["__local_working_directory__"] = self.local_working_directory
        return param_dict
  4. Hook execution - calls exec_before_job(validated_tool_state=internal_tool_state)

  5. exec_before_job (__init__.py:3757-3829): Takes validated_tool_state.input_state, creates JobProxy, pre-computes command via cwltool, stores in param_dict["__cwl_command"].

The Input State Gap

Currently there is a structural gap in the new path: validated_tool_state.input_state at exec_before_job time still contains dataset references ({src: "hda", id: N}) rather than CWL File objects with paths. The JobProxy._normalize_job() expects File objects with path or location keys.

This conversion (dataset reference -> CWL File object with filesystem path) is the key missing piece. In the YAML tool path, runtimeify() + setup_for_runtimeify() handles this. For CWL, it needs to happen somewhere between state reconstruction and JobProxy creation, enriched with CWL-specific data (secondaryFiles, format URIs, etc.).

Dynamic Tool Loading (Test Infrastructure)

When a CWL tool is not pre-loaded in the toolbox, tests create it dynamically:

# populators.py:3040-3050
if os.path.exists(tool_id):
    tool_versions = self.dataset_populator._get("tools", data=dict(tool_id=raw_tool_id)).json()
    if tool_versions:
        galaxy_tool_id = raw_tool_id
    else:
        dynamic_tool = self.dataset_populator.create_tool_from_path(tool_id)
        galaxy_tool_id = None
        tool_uuid = dynamic_tool["uuid"]

create_tool_from_path() (line 1057) posts to Galaxy's dynamic tool creation API with src="from_path". This uses lib/galaxy/managers/tools.py which requires enable_beta_tool_formats config.

Test API Paths

Test Method API Endpoint Input Format Notes
_run() in test_tools_cwl.py POST /api/tools Galaxy ({src: "hda", id: ...}) or CWL Legacy path
CwlPopulator._run_cwl_tool_job() POST /api/jobs CWL-native New tool request API
CwlPopulator.run_cwl_job() Routes to above CWL job JSON file Stages inputs first

Summary of Loading -> Execution Path

1. Tool Loading (startup or dynamic):
   .cwl file -> CwlToolSource -> ToolProxy -> CwlTool/GalacticCwlTool

2. API Request:
   POST /api/jobs {tool_id, inputs: {param: {src: "hda", id: ...}}}

3. Request Processing:
   -> ToolRequest created -> Celery task dispatched
   -> Tool deserialized from CwlToolSource persistent representation
   -> dereference() resolves data references to HDAs

4. Job Creation:
   -> expand_incoming_async() bypasses Galaxy parameter expansion (has_galaxy_inputs=False)
   -> JobInternalToolState validated against CWL parameter models
   -> Job persisted with tool_state = input_state dict

5. Job Execution:
   -> ToolEvaluator (not UserToolEvaluator)
   -> JobInternalToolState reconstructed from job.tool_state
   -> exec_before_job():
      -> input_json = validated_tool_state.input_state
      -> [GAP: needs dataset ref -> File object conversion]
      -> JobProxy(input_json, output_dict, job_dir)
      -> cwltool generates command, stages files
      -> param_dict["__cwl_command"] = command_line
   -> build() uses __cwl_command verbatim

6. Output Collection:
   -> relocate_dynamic_outputs.py (appended to job script)
   -> Reconstructs JobProxy from .cwl_job.json
   -> cwltool's collect_outputs() evaluates output globs

Unresolved Questions

  1. Only parameters/cwl_int.cwl is in sample_tool_conf.xml from the parameters directory — should other CWL parameter tools (cwl_float.cwl, cwl_string.cwl, cwl_file.cwl, etc.) be added?

  2. The has_galaxy_inputs flag for CWL is True because inputs_style="cwl" satisfies inputs_defined. How is this being overridden to False in the new path? Is there a separate mechanism?

  3. How are CWL array and record input types handled by the new parameter model system? _from_input_source_cwl() only handles simple types and unions — no array/record support yet.

  4. CwlUnionParameterModel has request_requires_value = False (with TODO comment) — is this correct for all unions, or only unions containing null?

CWL Validated Runtime Plan

How to plumb CWL tool execution from persisted JobInternalToolState through to cwltool command extraction, using the YAML tool runtimeify infrastructure.

Branch: cwl_on_tool_request_api_2


Entry Assumptions

  • Request entered via Tool Request API (POST /api/jobs)
  • Job object has job.tool_state containing persisted JobInternalToolState (dataset refs as {src: "hda", id: N})
  • CWL ToolParameter objects available matching the state schema
  • has_galaxy_inputs = False -- Galaxy's legacy parameter machinery bypassed
  • ToolEvaluator selected (not UserToolEvaluator) because CWL tools lack base_command/shell_command

Step 1: Reconstruct JobInternalToolState from job.tool_state

Where: ToolEvaluator.set_compute_environment() at evaluation.py:217-220

Current code (already works):

internal_tool_state = None
if job.tool_state:
    internal_tool_state = JobInternalToolState(job.tool_state)
    internal_tool_state.validate(self.tool, f"{self.tool.id} (job internal model)")

job.tool_state is the JSON dict persisted at job creation (execute.py:254-256). JobInternalToolState wraps it and validates against the tool's CWL parameter model. The validated state flows into execute_tool_hooks() at line 222.

No changes needed here.


Step 2: Runtimeify -- Convert Dataset References to CWL File Objects

This is the core new work. Currently exec_before_job receives validated_tool_state.input_state with raw {src: "hda", id: N} references and passes them directly to JobProxy. But JobProxy._normalize_job() (parser.py:376-391) expects File objects with path/location keys. That's a structural mismatch.

2a: Call runtimeify before exec_before_job

Where: ToolEvaluator.set_compute_environment(), in the param_dict_style == "regular" branch at evaluation.py:200-222.

Change: After reconstructing internal_tool_state and before calling execute_tool_hooks, insert a runtimeify call:

internal_tool_state = None
if job.tool_state:
    internal_tool_state = JobInternalToolState(job.tool_state)
    internal_tool_state.validate(self.tool, f"{self.tool.id} (job internal model)")

# NEW: runtimeify for CWL tools
if internal_tool_state is not None and self.tool.tool_type in CWL_TOOL_TYPES:
    from galaxy.tool_util.parameters.convert import runtimeify
    from galaxy.tools.runtime import setup_for_runtimeify

    hda_references, adapt_dataset, adapt_collection = setup_for_runtimeify(
        self.app, compute_environment, inp_data, input_dataset_collections
    )
    job_runtime_state = runtimeify(
        internal_tool_state, self.tool, adapt_dataset, adapt_collection
    )
    # Replace internal_tool_state with runtime state for exec_before_job
    internal_tool_state = job_runtime_state

Why here and not inside exec_before_job: setup_for_runtimeify needs inp_data (the {name: HDA} dict) and compute_environment, both available at this scope. This mirrors where UserToolEvaluator calls runtimeify in its build_param_dict() (evaluation.py:1130-1170). Keeping runtimeify in the evaluator means exec_before_job receives already-resolved File objects -- cleaner separation.

What runtimeify returns: A JobRuntimeToolState whose input_state dict has DataInternalJson File objects instead of {src: "hda", id: N} references. Each File object contains class, path, basename, nameroot, nameext, format, size, location, listing.

The state type passed to exec_before_job changes: from JobInternalToolState to JobRuntimeToolState. The execute_tool_hooks signature accepts Optional[JobInternalToolState] -- we either widen it to accept both types, or pass the runtime state's input_state dict through a wrapper. The simplest approach: change the parameter type to Optional[Union[JobInternalToolState, JobRuntimeToolState]] and update exec_before_job to accept either. Both expose .input_state with the same interface.

2b: CWL-specific adapt_dataset -- secondary files, format URIs

The existing adapt_dataset in runtime.py:77-94 produces a basic DataInternalJson with class, path, basename, format (Galaxy extension), size, location, listing. This is sufficient for YAML tools. CWL needs more:

  1. Secondary files -- stored at {hda.extra_files_path}/__secondary_files__/
  2. CWL format URIs -- http://edamontology.org/{edam_format} (available via hda.cwl_formats, model __init__.py:5212-5213)
  3. Checksum -- optional, cwltool doesn't require it for command generation

Option A (preferred): CWL-specific adapt_dataset callback.

Create a new function in runtime.py (or a new cwl_runtime.py module) that wraps the base adapt_dataset:

# lib/galaxy/tools/cwl_runtime.py (new file)

def setup_for_cwl_runtimeify(app, compute_environment, input_datasets, input_dataset_collections=None):
    """CWL-enriched version of setup_for_runtimeify.

    Returns (hda_references, adapt_dataset, adapt_collection) where
    adapt_dataset produces CwlDataInternalJson with secondaryFiles.
    """
    hda_references, base_adapt_dataset, adapt_collection = setup_for_runtimeify(
        app, compute_environment, input_datasets, input_dataset_collections
    )

    hdas_by_id = {d.id: d for d in input_datasets.values() if d is not None}

    def adapt_dataset(value):
        base_result = base_adapt_dataset(value)
        hda = hdas_by_id.get(value.id)
        if hda is None:
            return base_result

        result_dict = base_result.model_dump(by_alias=True)

        # Enrich with secondary files
        secondary_files = discover_secondary_files(hda, compute_environment)
        if secondary_files:
            result_dict["secondaryFiles"] = secondary_files

        # Enrich with CWL format URI (replace Galaxy extension with EDAM URI)
        if hasattr(hda, 'cwl_formats') and hda.cwl_formats:
            result_dict["format"] = str(hda.cwl_formats[0])

        return CwlDataInternalJson(**result_dict)

    return hda_references, adapt_dataset, adapt_collection

Option B (alternative): Enrich after runtimeify.

Call base runtimeify(), then walk the result and enrich File objects with secondary files. Downside: requires a second traversal. Option A is cleaner.

2c: discover_secondary_files function

New function in cwl_runtime.py:

def discover_secondary_files(hda, compute_environment=None):
    """Discover secondary files for an HDA from its extra_files_path.

    Secondary files are stored at {extra_files_path}/__secondary_files__/
    with an ordering index at __secondary_files_index.json.

    Returns list of dicts: [{"class": "File"|"Directory", "path": "...", "basename": "..."}]
    """
    extra_files_path = hda.extra_files_path
    secondary_files_dir = os.path.join(extra_files_path, SECONDARY_FILES_EXTRA_PREFIX)

    if not os.path.exists(secondary_files_dir):
        return []

    secondary_files = []
    for name in os.listdir(secondary_files_dir):
        sf_path = os.path.join(secondary_files_dir, name)
        real_path = os.path.realpath(sf_path)
        is_dir = os.path.isdir(real_path)

        entry = {
            "class": "Directory" if is_dir else "File",
            "path": compute_environment.input_path_rewrite(sf_path) if compute_environment else sf_path,
            "basename": name,
        }
        secondary_files.append(entry)

    return secondary_files

This mirrors the logic in representation.py:163-183 (dataset_wrapper_to_file_json) but works from HDA objects rather than DatasetWrappers.

Important: The legacy code in representation.py:168-183 symlinks the primary file and secondary files into a shared _inputs directory so basename-based references work. We need the same symlinking -- but stage_files() in JobProxy handles this via cwltool's PathMapper. The paths in secondaryFiles should be the real paths; cwltool will stage them.

2d: CwlDataInternalJson model

Where: lib/galaxy/tool_util_models/parameters.py, extend from DataInternalJson

class CwlSecondaryFileJson(StrictModel):
    class_: Annotated[Literal["File", "Directory"], Field(alias="class")]
    path: str
    basename: str

class CwlDataInternalJson(DataInternalJson):
    """DataInternalJson extended with CWL-specific fields."""
    secondaryFiles: Optional[List[CwlSecondaryFileJson]] = None

Alternative: Add secondaryFiles directly to DataInternalJson as an optional field (it's already commented out at line 610). This avoids a subclass but pollutes the base model with CWL concerns. Prefer the subclass.

Validation concern: runtimeify() validates the output as JobRuntimeToolState (convert.py:576). The validation calls validate(input_models) which checks against parameter model schemas. CwlDataInternalJson must be accepted wherever DataInternalJson is. Since CwlDataInternalJson extends DataInternalJson, and secondaryFiles is optional, existing validators should accept it. The model_dump(by_alias=True) output will include secondaryFiles only when present.

2e: How runtimeify walks CWL parameters

runtimeify() (convert.py:539-577) uses visit_input_values() to walk the tool's parameter model. For CWL tools, the parameter model consists of CwlInputParameter objects (or similar). The visitor identifies DataParameterModel instances and calls adapt_dict() on their values.

Critical question: Do CWL tool parameters implement DataParameterModel? The CWL parameter model (CwlInputParameter) must either be or extend DataParameterModel for file-type inputs, or runtimeify won't recognize them. If CWL parameters use a different type hierarchy, we need either:

  1. Extend the to_runtime_callback to recognize CWL file parameters, or
  2. Ensure CWL file parameters are modeled as DataParameterModel

This needs verification by tracing the actual parameter model instances for a CWL tool.


Step 3: How Runtimeified State Flows into exec_before_job and JobProxy

3a: exec_before_job receives runtimeified state

Where: CwlCommandBindingTool.exec_before_job() at tools/__init__.py:3757-3829

Current code at line 3765:

input_json = validated_tool_state.input_state

After runtimeify, validated_tool_state is a JobRuntimeToolState. Its .input_state now contains CWL File objects instead of {src: "hda", id: N} refs. Example:

# Before runtimeify (JobInternalToolState):
{"input_file": {"src": "hda", "id": 42}}

# After runtimeify (JobRuntimeToolState):
{"input_file": {
    "class": "File",
    "path": "/galaxy/datasets/000/dataset_42.dat",
    "basename": "reads.fastq",
    "nameroot": "reads",
    "nameext": ".fastq",
    "format": "http://edamontology.org/format_1930",
    "size": 1048576,
    "location": "step_input://0",
    "secondaryFiles": [
        {"class": "File", "path": "/galaxy/datasets/000/dataset_42_files/__secondary_files__/reads.fastq.idx", "basename": "reads.fastq.idx"}
    ]
}}

3b: Cleanup filters in exec_before_job

Lines 3775-3783 currently filter out unset optional files and empty strings:

input_json = {k: v for k, v in input_json.items()
    if not (isinstance(v, dict) and v.get("class") == "File" and v.get("location") == "None")}
input_json = {k: v for k, v in input_json.items() if v != ""}

After runtimeify: The location == "None" check likely won't trigger because runtimeify only produces File objects for HDAs that actually exist (they come from inp_data which only has real datasets). Optional parameters with no value should appear as None in the state, not as File objects with location == "None".

The empty string filter handles optional string params with no value. This should still work -- runtimeify passes non-data parameters through unchanged (VISITOR_NO_REPLACEMENT).

Recommendation: Keep both filters for safety during transition, but add a TODO to remove once the old path is dead.

3c: JobProxy receives runtimeified input_json

exec_before_job passes input_json to:

cwl_job_proxy = self._cwl_tool_proxy.job_proxy(input_json, output_dict, local_working_directory)

Which calls JobProxy.__init__() (parser.py:332), storing input_json as self._input_dict, then immediately calling self._normalize_job().


Step 4: What _normalize_job() Still Needs to Do vs What Runtimeify Handled

What runtimeify handles:

  • Convert {src: "hda", id: N} to {"class": "File", "path": "...", "basename": "...", ...}
  • Resolve dataset paths via compute_environment.input_path_rewrite()
  • Populate basename, nameroot, nameext, format, size
  • Discover and attach secondary files
  • Resolve CWL format URIs

What _normalize_job() still does (parser.py:376-391):

  1. process.fill_in_defaults() -- fills CWL default values for parameters not provided in the input dict. Runtimeify doesn't know about CWL defaults (they're in the cwltool Process object, not the Galaxy parameter model). Still needed.

  2. visit_class(input_dict, ("File", "Directory"), pathToLoc) -- converts "path" keys to "location" keys. Runtimeify produces File objects with both path and location (the DataInternalJson model has both fields). cwltool expects location for its internal processing. Still needed, but becomes a no-op if location is already set. The pathToLoc callback only acts when location is absent:

    def pathToLoc(p):
        if "location" not in p and "path" in p:
            p["location"] = p["path"]
            del p["path"]

    Since runtimeify sets both path and location, this callback won't fire for runtimeified inputs. It's still needed for files injected by fill_in_defaults (CWL defaults might use path only).

  3. No structural transformation -- _normalize_job doesn't restructure records, arrays, or unions. It assumes the input dict already matches the CWL schema structure. Runtimeify preserves the original structure (it only replaces leaf values).

What _normalize_job does NOT need to change:

The function is already correct for the new path. It receives File objects (from runtimeify) where it previously received File objects (from to_cwl_job/galactic_flavored_to_cwl_job). The only difference is the source of those File objects.

Secondary files in _normalize_job:

visit_class also visits secondary files recursively (they're nested dicts with class: "File"). The pathToLoc callback will convert their path to location if needed. This works correctly with the secondary files we attach in discover_secondary_files.


Step 5: Comparison with YAML Tool Runtimeify Infrastructure

What we reuse from YAML tools:

Component Location Reuse
runtimeify() convert.py:539-577 Direct reuse -- same function, same visitor
visit_input_values() convert.py (via parameter visitor) Direct reuse
setup_for_runtimeify() runtime.py:50-123 Base infrastructure reused; CWL wraps it
DataInternalJson tool_util_models/parameters.py:594-612 Base model; CWL extends it
JobRuntimeToolState tool_util_models/parameters.py (state.py) Direct reuse
set_basename_and_derived_properties() cwl/util.py:44-47 Already shared

What CWL adds on top:

Component Location Purpose
CwlDataInternalJson tool_util_models/parameters.py (new) Extends DataInternalJson with secondaryFiles
CwlSecondaryFileJson tool_util_models/parameters.py (new) Model for secondary file entries
setup_for_cwl_runtimeify() tools/cwl_runtime.py (new) Wraps setup_for_runtimeify with CWL enrichment
discover_secondary_files() tools/cwl_runtime.py (new) Finds secondary files in HDA extra_files_path

Key structural differences from YAML path:

  1. YAML: UserToolEvaluator.build_param_dict() calls runtimeify, then uses result for do_eval() expression evaluation. CWL: ToolEvaluator.set_compute_environment() calls runtimeify, passes result to exec_before_job(), which feeds it to JobProxy/cwltool.

  2. YAML: Command built at _build_command_line() time via JavaScript expressions against runtimeified state. CWL: Command pre-computed by cwltool at exec_before_job() time, stashed in param_dict["__cwl_command"].

  3. YAML: adapt_dataset returns DataInternalJson (no secondary files). CWL: adapt_dataset returns CwlDataInternalJson (with secondary files, CWL format URIs).

  4. YAML: No post-runtimeify normalization needed. CWL: _normalize_job() still runs fill_in_defaults and pathToLoc after runtimeify.


Step 6: Changes to Existing Code

evaluation.py -- ToolEvaluator.set_compute_environment()

Lines 200-222: Insert runtimeify call in the param_dict_style == "regular" branch, between state reconstruction and execute_tool_hooks:

# After line 220 (validate internal_tool_state):
if internal_tool_state is not None and self.tool.tool_type in CWL_TOOL_TYPES:
    from galaxy.tool_util.parameters.convert import runtimeify
    from galaxy.tools.cwl_runtime import setup_for_cwl_runtimeify

    input_dataset_collections = None  # TODO: wire up from job.io_dicts()
    hda_references, adapt_dataset, adapt_collection = setup_for_cwl_runtimeify(
        self.app, compute_environment, inp_data, input_dataset_collections
    )
    internal_tool_state = runtimeify(
        internal_tool_state, self.tool, adapt_dataset, adapt_collection
    )

Type signature change: execute_tool_hooks() at line 236 accepts Optional[JobInternalToolState]. Widen to Optional[Union[JobInternalToolState, JobRuntimeToolState]]. Same for exec_before_job in tools/__init__.py.

tools/init.py -- CwlCommandBindingTool.exec_before_job()

Line 3757: Update type hint from Optional[JobInternalToolState] to Optional[Union[JobInternalToolState, JobRuntimeToolState]].

Lines 3775-3783: Keep the filtering but note it should be mostly unnecessary after runtimeify. Optional CWL inputs with no dataset should appear as None values, not as fake File objects.

No other changes -- the rest of exec_before_job (output_dict construction, JobProxy creation, command extraction, staging, saving) works the same regardless of whether input_json came from runtimeify or from the old path.

tool_util_models/parameters.py

After line 612: Add CwlSecondaryFileJson and CwlDataInternalJson models (see Step 2d).

convert.py -- runtimeify()

No changes to the function itself. The visitor pattern already handles any DataParameterModel leaf. The CWL-specific enrichment is handled by the callback (adapt_dataset), not by runtimeify's logic.

One potential issue: If CWL tool parameters aren't modeled as DataParameterModel, the visitor won't recognize them. See Unresolved Questions.

parser.py -- JobProxy._normalize_job()

No changes needed. fill_in_defaults and pathToLoc work correctly on pre-runtimeified input dicts. The pathToLoc callback is a no-op for entries that already have location set.

parser.py -- ToolProxy.init() format stripping

Lines 143-148 strip format from inputs_record_schema fields to prevent cwltool from complaining about missing format in input data. With the new path, we're providing CWL format URIs in the File objects (via hda.cwl_formats). This format stripping may become unnecessary -- but keep it for now since not all inputs may have format URIs, and it's a safe no-op if format URIs are present.


Step 7: New Code

lib/galaxy/tools/cwl_runtime.py (new file)

Contains:

  1. setup_for_cwl_runtimeify(app, compute_environment, input_datasets, input_dataset_collections)

    • Calls setup_for_runtimeify() to get base callbacks
    • Wraps adapt_dataset to add secondary files and CWL format URIs
    • Returns (hda_references, cwl_adapt_dataset, adapt_collection)
  2. discover_secondary_files(hda, compute_environment)

    • Checks {hda.extra_files_path}/__secondary_files__/
    • Lists files, builds secondary file dicts with class, path, basename
    • Returns List[dict] (empty if no secondary files)

lib/galaxy/tool_util_models/parameters.py (additions)

  1. CwlSecondaryFileJson -- model for secondary file entries
  2. CwlDataInternalJson(DataInternalJson) -- extends with secondaryFiles: Optional[List[CwlSecondaryFileJson]]

Summary: Data Flow

job.tool_state (JSON dict with {src: "hda", id: N} refs)
    |
    | JobInternalToolState(job.tool_state)
    | .validate(tool, ...)
    v
JobInternalToolState
    |
    | runtimeify(state, tool, cwl_adapt_dataset, adapt_collection)
    |   cwl_adapt_dataset:
    |     - resolves HDA path via compute_environment.input_path_rewrite()
    |     - sets basename, nameroot, nameext, format (EDAM URI), size
    |     - calls discover_secondary_files() for __secondary_files__
    |     - returns CwlDataInternalJson
    v
JobRuntimeToolState
    |
    | exec_before_job(validated_tool_state=runtime_state)
    | input_json = runtime_state.input_state
    v
input_json (dict with CWL File objects, secondary files attached)
    |
    | Filter: remove location=="None", empty strings
    |
    | JobProxy(tool_proxy, input_json, output_dict, job_dir)
    v
JobProxy._normalize_job()
    | fill_in_defaults() -- inject CWL defaults for missing params
    | visit_class(pathToLoc) -- convert path->location (mostly no-op)
    v
JobProxy._ensure_cwl_job_initialized()
    | Process.job(input_dict, callback, runtime_context) -> cwltool Job
    v
cwltool Job
    | .command_line -> list of args (with $GALAXY_SLOTS unsentineling)
    | .stdin, .stdout, .stderr
    | .environment
    | .pathmapper -> for stage_files()
    v
param_dict["__cwl_command"] = assembled command string
param_dict["__cwl_command_state"] = {args, stdin, stdout, stderr, env}
    |
    | ToolEvaluator.build()
    | _build_command_line() -> uses __cwl_command verbatim
    | _build_environment_variables() -> reads from __cwl_command_state.env
    v
Command string + environment -> job script execution

Testing Strategy

  1. Unit test discover_secondary_files(): Create mock HDA with extra_files_path containing __secondary_files__/ directory. Verify correct File/Directory classification and path resolution.

  2. Unit test CWL adapt_dataset: Given an HDA with secondary files and CWL format, verify CwlDataInternalJson has correct secondaryFiles list and EDAM format URI.

  3. Unit test runtimeify with CWL parameters: Create a CWL tool parameter model, a JobInternalToolState with HDA refs, run runtimeify, verify output has File objects with correct structure.

  4. Integration test: Run a CWL tool with secondary files (e.g., BAM + BAI) through the tool request API. Verify:

    • job.tool_state persisted correctly
    • runtimeify produces File objects with secondaryFiles
    • cwltool receives correct input dict
    • Command line generated correctly
    • Output collection works
  5. Regression test: Existing CWL conformance tests should pass with the new path.


Unresolved Questions

  1. CWL parameters as DataParameterModel? Does runtimeify's visit_input_values recognize CWL file parameters? If CWL uses CwlInputParameter instead of DataParameterModel, the visitor won't call adapt_dataset. Need to trace actual parameter model types for a CWL tool.

  2. Directory inputs? CWL directories are stored as tar archives in Galaxy (ext directory). Legacy code in representation.py:198+ untars into _inputs dir. Where does this happen in the new path? adapt_dataset produces class: "File" but directories need class: "Directory" with listing. Possibly needs a separate adapt_directory or a type check in adapt_dataset.

  3. Collection inputs? CWL arrays/records map to Galaxy collections. runtimeify has adapt_collection support but raises NotImplementedError for some cases in the base path. Do CWL tools ever receive collection inputs through the tool request API? If so, how are they modeled?

  4. Format stripping still needed? ToolProxy.__init__ strips format from inputs_record_schema (parser.py:143-148) so cwltool won't reject inputs missing format. If we now provide format URIs in File objects, does cwltool validate them against the schema format? If so, the stripping may cause a mismatch (schema says no format, input provides one). Needs testing.

  5. compute_environment for secondary file paths? discover_secondary_files uses compute_environment.input_path_rewrite() for path resolution. Does input_path_rewrite work on arbitrary paths (like extra_files_path subdirectories), or only on HDA file names? If the latter, secondary file paths may need a different rewriting strategy.

  6. Expression tools? ExpressionTools have no command line -- they run JS during output collection. The current path returns ["true"] as command. Does runtimeify affect ExpressionTool execution at all? The input_json still needs to flow to JobProxy.save_job() for output collection to work.

  7. Symlink staging for secondary files? Legacy dataset_wrapper_to_file_json symlinks primary + secondary files into a shared _inputs dir so basename-relative references work. With runtimeify, secondary file paths point to extra_files_path/__secondary_files__/. Will cwltool's PathMapper.stage_files() handle staging them correctly, or do we need explicit symlinking?

  8. input_dataset_collections wiring? The runtimeify call in set_compute_environment needs input_dataset_collections. job.io_dicts() returns (inp_data, out_data, out_collections) at line 185 but not input collections directly. Need to verify how job.input_dataset_collections maps to the dict format setup_for_runtimeify expects.

CWL Validated Runtime Plan: Workflow Runtime Implications

How CWL workflow execution interacts with the tool runtime code, what legacy code the workflow runner depends on, and what changes are needed to make the validated runtime plan work for workflows.

Branch: cwl_on_tool_request_api_2


Table of Contents

  1. How CWL Workflows Execute in Galaxy
  2. Workflow Dependencies on Tool Runtime Code
  3. What Legacy Code the Workflow Runner Needs
  4. Impact of the Validated Runtime Plan on Workflows
  5. Unresolved Questions

How CWL Workflows Execute in Galaxy

Import: CWL Workflow -> Galaxy Workflow

CWL workflows enter Galaxy through the workflow manager (lib/galaxy/managers/workflows.py:655-675):

CWL .cwl file
    | workflow_proxy(workflow_path)       -- parser.py:571 WorkflowProxy
    |
    | wf_proxy.tool_reference_proxies()  -- extract all tools recursively
    | for each tool_proxy:
    |   dynamic_tool_manager.create_tool_from_proxy(tool_proxy)  -- register as dynamic tool w/ UUID
    |
    | wf_proxy.to_dict()                 -- convert to Galaxy workflow dict format
    v
Galaxy workflow dict (steps, connections, tool_uuids)
    | standard Galaxy workflow import
    v
Stored Galaxy Workflow with WorkflowSteps

The WorkflowProxy.to_dict() (parser.py:677-699) converts CWL constructs into Galaxy equivalents:

  • CWL workflow inputs become Galaxy data_input, data_collection_input, or parameter_input steps depending on type (parser.py:714-755)
  • CWL tool steps become Galaxy tool steps with tool_uuid pointing to the registered dynamic CWL tool (parser.py:1083-1127)
  • CWL scatter is encoded on step input connections as scatter_type (parser.py:1007-1010)
  • CWL valueFrom expressions are stored as value_from on step input connections (parser.py:1011-1013)
  • CWL subworkflows become Galaxy subworkflow steps (parser.py:1114-1163)

Invocation: Scheduling and Step Execution

Once imported, CWL workflows run through Galaxy's standard workflow scheduler:

POST /api/workflows (invoke)
    | queue_invoke()                     -- run.py:128
    | workflow_request_to_run_config()   -- converts inputs
    v
WorkflowInvoker(trans, workflow, run_config)
    | .invoke()                          -- run.py:201
    | for each step:
    |   _invoke_step(invocation_step)
    |     -> step.module.execute(trans, progress, invocation_step)
    v
    [ToolModule.execute() for CWL tool steps]

Tool Step Execution: ToolModule.execute()

ToolModule.execute() (modules.py:2460-2746) is the critical path. For CWL tool steps:

Phase 1: State preparation (lines 2476-2509)

  • Computes runtime state from step state + step_updates
  • Builds execution_state with tool_inputs from tool.inputs
  • Determines collection_info for scatter (collection mapping)

Phase 2: Input connection wiring (lines 2512-2612)

  • visit_input_values(tool_inputs, execution_state.inputs, callback) walks Galaxy's parsed parameter tree
  • The callback function resolves each input by looking up progress.replacement_for_input() which follows input connections to upstream step outputs
  • For FieldTypeToolParameter inputs (CWL's catch-all), the callback wraps values as {"src": "hda", "value": hda_object} or {"src": "hdca", "value": hdca_object} or {"src": "json", "value": scalar} (lines 2591-2597)

Phase 3: valueFrom expression evaluation (lines 2635-2663)

  • evaluate_value_from_expressions() (line 2426) converts all step state to CWL format via to_cwl() (modules.py:152-226)
  • Evaluates CWL JavaScript expressions via do_eval() against the CWL-format state
  • from_cwl() (modules.py:229-242) converts results back to Galaxy objects
  • Results applied via expression_callback which handles FieldTypeToolParameter specially (line 2647)

Phase 4: Job submission (lines 2673-2699)

  • Creates MappingParameters(tool_state.inputs, param_combinations) -- only two args, no validated_param_combinations
  • Calls execute() from tools/execute.py
  • This goes through handle_single_execution() -> _execute() which creates the Job
  • Step outputs (HDAs/HDCAs) flow back to progress.set_step_outputs()

Data Flow Between Steps

Between workflow steps, data flows as Galaxy model objects (HDAs and HDCAs), NOT as CWL File objects:

Step A outputs: {output_name: HDA}
    | stored in progress.outputs[step_id]
    |
Step B execution:
    | callback asks progress.replacement_for_input()
    | which calls progress.replacement_for_connection()     -- run.py:615
    | which returns the HDA from progress.outputs[step_id]
    |
    | If FieldTypeToolParameter: wraps as {"src": "hda", "value": HDA}
    | If DataToolParameter: passes HDA directly
    v
execution_state.inputs = {input_name: HDA or wrapped_value}

The to_cwl() function (modules.py:152-226) converts Galaxy objects to CWL-format dicts only for valueFrom/when expression evaluation. It is NOT used for the main tool execution path. For expression evaluation:

  • HDA -> {"class": "File", "location": "step_input://N", "format": ext, "path": file_path, "basename": name, ...} (line 186-195)
  • expression.json HDA -> the parsed JSON content (line 181-183)
  • DatasetCollection (list) -> array of CWL values (line 197-201)
  • DatasetCollection (record) -> dict of CWL values (line 203-212)

The from_cwl() function (modules.py:229-242) converts back:

  • CWL File dict with class: "File" -> calls progress.raw_to_galaxy(value) which calls raw_to_galaxy() from basic.py:2818
  • CWL File dict with step_input://N location -> looks up HDA from hda_references list (line 238)

Where raw_to_galaxy() Is Called

raw_to_galaxy() (basic.py:2818-2904) creates a new deferred HDA or HDCA from a CWL dict. It's used in the workflow runtime at:

  1. WorkflowProgress.raw_to_galaxy() (run.py:947-948) -- wraps raw_to_galaxy(trans.app, trans.history, value)
  2. from_cwl() (modules.py:236) -- for File results from valueFrom expressions
  3. set_outputs_for_input() (run.py:762-764) -- for workflow input steps that receive CWL File dicts
  4. replacement_for_input() (run.py:590,612) -- for step inputs with value_from defaults
  5. compute_runtime_state() callback (modules.py:488) -- for step inputs with default values
  6. expression_callback in ToolModule.execute() (modules.py:2654-2656) -- for CWL File defaults not handled elsewhere
  7. FieldTypeToolParameter.from_json() (basic.py:2924-2925) -- when deserializing CWL File objects in parameter values

CWL Scatter

CWL scatter is mapped to Galaxy's implicit collection mapping at import time:

  1. At import: InputProxy.to_dict() sets scatter_type on step input connections (parser.py:1007-1010). Scatter inputs get "dotproduct" (or the explicit scatterMethod). Non-scatter inputs get "disabled".

  2. At runtime: _find_collections_to_match() (modules.py:591-640) reads step_input.scatter_type and builds CollectionsToMatch. When scatter_type == "disabled", the input is skipped for collection matching (line 602-606). Collection inputs with scatter_type == "dotproduct" are matched using Galaxy's standard collection mapping infrastructure.

  3. The mapping: CWL scatter over an array input corresponds to Galaxy iterating over a list collection. The workflow runner creates implicit collections (mapped-over outputs) via execution_tracker.implicit_collections.

  4. Limitation: Only dotproduct and disabled are asserted (line 602). flat_crossproduct is defined in the parser (parser.py:1008) but not handled by the workflow runner.

valueFrom Expressions

CWL valueFrom JavaScript expressions are evaluated at workflow scheduling time, NOT at tool execution time.

Where defined: InputProxy.to_dict() stores value_from on step input connections (parser.py:1011-1013). These are persisted on WorkflowStepInput.value_from in the database.

Where evaluated: ToolModule.evaluate_value_from_expressions() (modules.py:2426-2458) and the top-level evaluate_value_from_expressions() (modules.py:245-298) for when expressions.

How evaluated:

  1. All step state (both execution_state inputs and extra_step_state) converted to CWL format via to_cwl() (lines 2443-2447)
  2. Each value_from expression evaluated by do_eval(value_from, step_state, context=step_state[key]) (lines 2450-2453)
  3. Results converted back via from_cwl() (line 2455)
  4. Applied to execution_state.inputs via expression_callback (lines 2642-2662)

Dependency on to_cwl(): The expression evaluation depends on to_cwl() to convert Galaxy objects (HDAs, collections) to CWL-compatible dicts that JavaScript expressions can operate on. This is a modules.py-local function, NOT part of representation.py.

Dependency on from_cwl(): Results are converted back via from_cwl(), which depends on raw_to_galaxy() for File results. This creates deferred HDAs.


Workflow Dependencies on Tool Runtime Code

What the Workflow Runner USES

Component Location Used By Workflow Runner? Purpose
to_cwl() modules.py:152 YES Convert Galaxy objects to CWL for valueFrom/when expressions
from_cwl() modules.py:229 YES Convert CWL expression results back to Galaxy objects
raw_to_galaxy() basic.py:2818 YES Create deferred HDAs from CWL File dicts
FieldTypeToolParameter basic.py:2907 YES CWL catch-all parameter type, used for wrapping values in visit_input_values callback
do_eval() expressions/evaluation.py:21 YES Evaluate CWL JavaScript expressions
set_basename_and_derived_properties() cwl/util.py:44 YES (via to_cwl) Set basename/nameroot/nameext on File objects
to_galaxy_parameters() representation.py:491 INDIRECT Called by CwlTool.inputs_from_dict() for inputs_representation="cwl" API submissions
FieldTypeToolParameter.from_json() basic.py:2915 YES When workflow populates step state from API

What the Workflow Runner DOES NOT USE

Component Location Why Not Used
to_cwl_job() representation.py:386 Only called from CwlTool.param_dict_to_cwl_inputs(), NOT from workflow code
galactic_flavored_to_cwl_job() representation.py:286 Only called from GalacticCwlTool.param_dict_to_cwl_inputs()
dataset_wrapper_to_file_json() representation.py:155 Only called within to_cwl_job() and galactic_flavored_to_cwl_job()
galactic_job_json() cwl/util.py:153 Client-side staging only (test framework, planemo)
output_to_cwl_json() cwl/util.py:513 Test infrastructure only (populators.py)
JobProxy parser.py:329 Used by exec_before_job, not by workflow runner
runtime_actions.py cwl/runtime_actions.py Post-execution output collection script, not workflow runner

The Key Distinction

The workflow runner operates at a different layer than tool execution:

  • Workflow runner converts between Galaxy model objects (HDAs, HDCAs, DatasetCollections) and CWL-format dicts for expression evaluation. It uses to_cwl() (modules.py) and from_cwl() (modules.py) -- these are workflow-specific functions, NOT representation.py functions.

  • Tool execution converts between Galaxy parameter dicts (DatasetWrappers, wrapped values) and CWL job JSON (File objects with paths). It uses to_cwl_job() / galactic_flavored_to_cwl_job() (representation.py) -- this is the legacy code the plan eliminates.

These are separate code paths that both happen to produce CWL-format dicts.


What Legacy Code the Workflow Runner Needs

Code That Blocks Cleanup

1. FieldTypeToolParameter (basic.py:2907-2970)

Used by workflow runner at:

  • modules.py:2567 -- isinstance(input, FieldTypeToolParameter) to classify inputs as "data"
  • modules.py:2591-2597 -- wrapping values for FieldTypeToolParameter with {"src": "hda/hdca/json", "value": ...}
  • modules.py:2647 -- expression_callback wraps expression results for FieldTypeToolParameter

Why it blocks cleanup: CWL tool inputs are parsed as FieldTypeToolParameter instances via the CWL tool parser. The workflow step execution callback in ToolModule.execute() uses isinstance(input, FieldTypeToolParameter) to determine how to pass values. If we remove FieldTypeToolParameter, the workflow runner needs a different way to identify CWL-typed inputs.

Assessment: FieldTypeToolParameter CAN be removed from the tool execution path (the validated runtime plan bypasses it), but the workflow runner's visit_input_values callbacks still need to handle CWL inputs specially. With has_galaxy_inputs=False, the tool will have no Galaxy-style inputs, so tool.inputs will be empty (or different), and visit_input_values won't find any parameters to visit. This changes how the workflow runner wires inputs -- see Impact section.

2. raw_to_galaxy() (basic.py:2818-2904)

Used by workflow runner at: 7 locations (see list above).

Why it blocks cleanup: Creates deferred HDAs from CWL File dict objects. This is needed when:

  • valueFrom expressions produce File objects that need to become HDAs
  • Step inputs have CWL File defaults
  • Workflow inputs are specified as CWL File dicts

Assessment: This function is NOT part of representation.py and is NOT part of the legacy parameter hack. It's a general utility for creating HDAs from CWL-format dicts. It should survive cleanup. However, it lives in basic.py alongside FieldTypeToolParameter -- if basic.py gets major CWL surgery, raw_to_galaxy() needs to be preserved (perhaps moved to a more appropriate module).

3. to_cwl() / from_cwl() (modules.py:152-242)

These are workflow-specific functions in modules.py, NOT in representation.py. They convert Galaxy model objects (HDAs, DatasetCollections) to/from CWL-format dicts for JavaScript expression evaluation.

Why they exist: CWL valueFrom and when expressions operate on CWL-format data. The workflow runner must convert Galaxy objects to CWL format before evaluating expressions, and convert results back.

Assessment: These are independent of the legacy parameter hack and will continue to be needed regardless of the validated runtime plan.

4. to_galaxy_parameters() (representation.py:491-575)

Called from: CwlTool.inputs_from_dict() (tools/init.py:3877) when inputs_representation == "cwl".

Used by workflow runner: Not directly. But inputs_from_dict() is called from ToolsService (webapps/galaxy/services/tools.py:327) when a tool is invoked via the old /api/tools endpoint with CWL-format inputs. This is not the workflow path.

Assessment: With the tool request API, to_galaxy_parameters() becomes dead code. The workflow runner never calls it.

5. Conditional / Repeat parameter types for CWL

The legacy representation layer creates Galaxy Conditional and Repeat inputs to model CWL union types and arrays. These are referenced in:

  • representation.py:to_cwl_job() -- walks conditionals and repeats to reverse-engineer CWL types
  • representation.py:to_galaxy_parameters() -- creates conditional/repeat request format

Assessment: These are used ONLY in the tool execution path (to_cwl_job) and the API input conversion path (to_galaxy_parameters). The workflow runner's visit_input_values callback encounters them in tool.inputs, but after the validated runtime plan, CWL tools won't have these Galaxy-style inputs at all.


Impact of the Validated Runtime Plan on Workflows

The Critical Bug: validated_tool_state Is None for Workflow Steps

This is the most important finding.

The current WIP code at tools/__init__.py:3765:

input_json = validated_tool_state.input_state

This will crash (AttributeError: 'NoneType') when a CWL tool runs within a workflow because:

  1. ToolModule.execute() creates MappingParameters(tool_state.inputs, param_combinations) with only two args (modules.py:2674)
  2. validated_param_combinations defaults to None
  3. In execute_single_job(), execution_slice.validated_param_combination is None
  4. In _execute(), job.tool_state is NOT set because the validated state is None (execute.py:254-256)
  5. In ToolEvaluator.set_compute_environment(), job.tool_state is None, so internal_tool_state is None (evaluation.py:217-220)
  6. None is passed to exec_before_job(validated_tool_state=None) (evaluation.py:239)
  7. Line 3765 crashes: None.input_state

The legacy fallback that would have caught this was:

input_json = self.param_dict_to_cwl_inputs(param_dict, local_working_directory)

This has been removed in the WIP commits, replaced by the direct validated_tool_state.input_state access.

What Needs to Change for Workflows

There are two approaches:

Approach A: Make the Workflow Runner Set validated_param_combinations

The workflow runner would need to:

  1. Build a JobInternalToolState for each CWL tool step (converting the Galaxy object-rich execution_state.inputs into a serializable dict with {src: "hda", id: N} references)
  2. Pass these as validated_param_combinations in MappingParameters

This is harder than it sounds because execution_state.inputs for CWL tools contains wrapped values from FieldTypeToolParameter ({"src": "hda", "value": <HDA>}, {"src": "json", "value": 42}), not the simple {src: "hda", id: N} format that JobInternalToolState expects.

Implications: The workflow step execution callback (ToolModule.execute()) would need restructuring for CWL tools. Instead of visiting tool.inputs (which uses the legacy Galaxy parameter tree with FieldTypeToolParameter/Conditional/Repeat), it would need to build a CWL-format input dict directly from the step connections.

Approach B: Dual Path in exec_before_job

Keep both paths:

if validated_tool_state is not None:
    input_json = validated_tool_state.input_state
else:
    # Legacy workflow path: reverse-engineer CWL from Galaxy param_dict
    input_json = self.param_dict_to_cwl_inputs(param_dict, local_working_directory)

This preserves the legacy code for workflow execution while allowing the tool request API to use the clean path.

Implications: param_dict_to_cwl_inputs(), to_cwl_job(), and galactic_flavored_to_cwl_job() must be retained. FieldTypeToolParameter, Conditional, Repeat CWL-specific parameter types must be retained. The legacy representation.py code stays alive.

Approach C: Bypass the Legacy Parameter System for CWL Workflow Steps

The most ambitious approach: change ToolModule.execute() to handle CWL tools differently. Instead of visit_input_values(tool.inputs, execution_state.inputs, callback), build the CWL input dict directly from step connections:

if tool.tool_type in CWL_TOOL_TYPES:
    # Build CWL input dict directly from connections
    cwl_input_dict = {}
    for input_name, connections in step.input_connections_by_name.items():
        replacement = progress.replacement_for_connection(connections[0])
        cwl_input_dict[input_name] = to_cwl_reference(replacement)  # {src: "hda", id: N}

    # Handle valueFrom expressions (still needs to_cwl/from_cwl for JS eval)
    ...

    internal_state = JobInternalToolState(cwl_input_dict)
    # Pass to execute with validated_param_combinations

This would allow removing FieldTypeToolParameter entirely and all the conditional/repeat CWL hacks.

How Runtimeify Affects Workflow Step Execution

The runtimeify plan (converting {src: "hda", id: N} references to CWL File objects with paths) works the same for workflow-invoked CWL tools as for directly-invoked ones -- if we can get a JobInternalToolState set on the job. The runtimeify step happens in ToolEvaluator.set_compute_environment() (evaluation.py:200-222), which runs for all jobs regardless of how they were submitted.

The key question is: who builds the JobInternalToolState for workflow-invoked CWL tools?

What About to_cwl() / from_cwl() in modules.py?

These functions are used for valueFrom/when expression evaluation and are independent of the tool execution path. They:

  • Convert Galaxy model objects to CWL-format dicts for JavaScript
  • Convert JavaScript results back to Galaxy model objects

The validated runtime plan does NOT affect these. Even with Approach C, valueFrom expressions still need to_cwl() to convert HDAs to CWL File objects for JavaScript evaluation. However, to_cwl() in modules.py is much simpler than to_cwl_job() in representation.py:

  • It works with Galaxy model objects (HDAs), not DatasetWrappers
  • It doesn't handle conditionals, repeats, or the _cwl__type_/_cwl__value_ encoding
  • It doesn't need secondary files, checksums, or full CWL compliance

Impact on CWL Scatter

CWL scatter maps to Galaxy's collection-based parallelism. The scatter handling happens at the workflow runner level (_find_collections_to_match() in modules.py:591-640) and in Galaxy's standard collection mapping infrastructure. This is independent of the tool runtime code.

The validated runtime plan does NOT affect scatter. The scatter produces multiple execution slices (param_combinations), each of which becomes a separate job. Each job goes through the same exec_before_job path.

However, if Approach C is taken (bypassing legacy parameter system for CWL workflow steps), scatter handling would need to work with the new input dict format instead of the legacy execution_state.inputs.


Unresolved Questions

  1. Which approach for workflow support? A (make workflow runner set validated state), B (dual path in exec_before_job), or C (bypass legacy param system for CWL workflow steps)? B is safest/fastest but prevents cleanup. C is cleanest but most work.

  2. Can we ship tool-only support first? The tool request API path works for direct CWL tool invocation. Can workflows remain on the legacy path (Approach B) while we iterate toward Approach A or C?

  3. How do FieldTypeToolParameter instances get created for CWL tools with has_galaxy_inputs=False? If CWL tools no longer have Galaxy-style inputs (because the new commits stripped InputInstance to just name/label/description), does tool.inputs contain FieldTypeToolParameter instances at all? If not, the visit_input_values callback in ToolModule.execute() won't match any inputs, and the wiring will be broken for CWL workflow steps.

  4. What parameter model does CwlToolSource produce now? The parse_input_pages() changes in the WIP commits affect what tool.inputs looks like for CWL tools. If tool.inputs is empty, the entire visit_input_values loop in ToolModule.execute() becomes a no-op, and inputs won't be wired correctly for workflow execution.

  5. Does to_galaxy_parameters() need to survive? It's called from inputs_from_dict() which handles inputs_representation="cwl" on the old /api/tools endpoint. If we drop legacy API support, this can go. But do workflows use it? (Answer: no, the workflow runner builds state differently.)

  6. How do workflow step defaults interact with the new path? CWL step input defaults (parser.py:1014-1015) are stored on WorkflowStepInput.default_value. These can be CWL File dicts. The workflow runner calls raw_to_galaxy() to convert them to HDAs. This is independent of the tool runtime, but raw_to_galaxy lives in basic.py near the code we might want to clean up.

  7. Can raw_to_galaxy() move out of basic.py? It's a general utility for creating deferred HDAs from CWL dicts. It doesn't depend on FieldTypeToolParameter. Moving it to a CWL-specific module (e.g., tools/cwl_runtime.py) would decouple it from basic.py cleanup.

  8. How do expression.json outputs flow between CWL workflow steps? CWL ExpressionTools produce expression.json datasets. Downstream steps that receive these check hda.ext == "expression.json" and read the JSON content instead of using the HDA as a file (modules.py:2531-2548, modules.py:2571-2589). This is handled in the workflow runner, not in tool runtime code. Does it still work with the new parameter model?

  9. What about flat_crossproduct scatter? The parser generates it (parser.py:1008) but the workflow runner only asserts dotproduct or disabled (modules.py:602). Is this a pre-existing gap or something that needs addressing for the runtime plan?

  10. Test coverage for CWL workflows? test_workflows_cwl.py has tests for simple workflows, multi-step, scatter, and collection outputs. These all go through the standard invoke_workflow_and_assert_ok path which uses the workflow runner, NOT the tool request API. These tests will catch workflow regressions.

CWL Validated Runtime: Workflow Adaption Plan (Approach C)

Bypass the legacy Galaxy parameter system for CWL workflow steps entirely. Build CWL input dicts directly from step connections, set validated_param_combinations on MappingParameters, and converge tool-invoked and workflow-invoked CWL tools onto the same validated runtime path.

Branch: cwl_on_tool_request_api_2


What to Do During CWL_VALIDATED_RUNTIME_PLAN Implementation to Set This Up

These actions during the tool-only runtime plan make the workflow adaption easier and more likely to succeed. None break tool-only functionality.

1. Keep exec_before_job dual-path temporarily

if validated_tool_state is not None:
    input_json = validated_tool_state.input_state
else:
    input_json = self.param_dict_to_cwl_inputs(param_dict, local_working_directory)

This ensures existing CWL workflow tests keep passing while you iterate on tool-only runtimeify. The fallback path fires for workflow-invoked CWL tools (where validated_tool_state is None). Remove it only after Approach C lands.

2. Ensure CwlFileParameterModel is recognized by runtimeify

The runtimeify visitor checks isinstance(parameter, DataParameterModel). CWL file parameters use CwlFileParameterModel which extends BaseGalaxyToolParameterModelDefinition, not DataParameterModel. The visitor won't recognize them without a fix.

During tool-only plan: Either extend the visitor's to_runtime_callback to also check for CwlFileParameterModel / CwlDirectoryParameterModel, or make those types inherit from (or register as) DataParameterModel. This is needed for tool-only runtimeify too, so it's not premature work.

3. Design the CWL input dict format to be workflow-friendly

When building the JobInternalToolState for tool-only invocations, ensure the dict format is:

{
    "file_input": {"src": "hda", "id": 42},
    "string_param": "hello",
    "int_param": 7,
    "optional_file": None,    # absent optional
}

This same format is what the workflow adaption will build from step connections. Don't embed Galaxy objects or wrapper dicts. Keep it serializable JSON with dataset refs as {src, id}.

4. Validate runtimeify works without collection inputs initially

The workflow adaption needs collection inputs eventually (for scatter), but the tool-only plan can punt on adapt_collection for CWL. Just ensure it doesn't crash -- raise NotImplementedError with a clear message. Scatter handling is Step 5 of this plan.

5. Move raw_to_galaxy() out of basic.py

raw_to_galaxy() creates deferred HDAs from CWL File dicts. The workflow runner needs it for valueFrom expression results. Move it to tools/cwl_runtime.py (the new file from the tool-only plan) during that work so basic.py cleanup is unblocked later.

6. Add workflow CWL tests to CI monitoring

Run test_workflows_cwl.py in CI during the tool-only work. These tests use the legacy workflow path and should keep passing through the dual-path in exec_before_job. If they break, you'll know something in the tool-only changes affected the legacy path.


Prerequisites

  • CWL_VALIDATED_RUNTIME_PLAN mostly working: tool-only CWL execution via tool request API using runtimeify, CwlDataInternalJson with secondary files, validated_tool_state flowing through exec_before_job
  • exec_before_job dual-path in place (fallback to legacy for workflows)
  • Existing CWL workflow tests passing via the legacy fallback

Step 1: Build CWL Input Dict from Step Connections

Goal: Replace visit_input_values(tool.inputs, execution_state.inputs, callback) for CWL tools with direct dict construction from step connections.

Why: With has_galaxy_inputs=False, CWL tools have no Galaxy-style tool.inputs tree. The visit_input_values callback can't wire inputs. Even if tool.inputs were populated, the callback uses isinstance(input, FieldTypeToolParameter) which depends on the legacy parameter model we're eliminating.

1a: New function: build_cwl_input_dict()

Where: lib/galaxy/workflow/modules.py, new function

def build_cwl_input_dict(
    step: WorkflowStep,
    progress: WorkflowProgress,
    trans,
) -> dict[str, Any]:
    """Build a CWL-native input dict from step connections.

    Returns dict with values as:
    - HDA inputs:  {"src": "hda", "id": N}
    - HDCA inputs: {"src": "hdca", "id": N}
    - Scalars:     raw values (int, str, float, bool, None)
    - expression.json HDAs: parsed JSON content (for non-data connections)
    """
    cwl_input_dict = {}

    for input_name, connections in step.input_connections_by_name.items():
        if len(connections) == 1:
            replacement = progress.replacement_for_connection(connections[0])
        else:
            # Multiple connections -- merge into collection
            replacement = progress.replacement_for_input_connections(
                step, _input_dict_for_name(input_name, step), connections
            )

        if isinstance(replacement, NoReplacement):
            continue

        cwl_input_dict[input_name] = _galaxy_to_cwl_ref(replacement)

    # Fill defaults from step inputs
    for step_input in step.inputs:
        name = step_input.name
        if name not in cwl_input_dict:
            if step_input.default_value is not None:
                cwl_input_dict[name] = _resolve_default(
                    step_input.default_value, trans, progress
                )

    return cwl_input_dict

1b: Conversion helper: _galaxy_to_cwl_ref()

def _galaxy_to_cwl_ref(value):
    """Convert Galaxy model objects to CWL input dict references."""
    if isinstance(value, model.HistoryDatasetAssociation):
        if value.ext == "expression.json":
            # Expression tool outputs: read JSON content as scalar
            with open(value.get_file_name()) as f:
                return json.load(f)
        return {"src": "hda", "id": value.id}
    elif isinstance(value, model.HistoryDatasetCollectionAssociation):
        return {"src": "hdca", "id": value.id}
    elif isinstance(value, model.DatasetCollectionElement):
        return {"src": "dce", "id": value.id}
    else:
        # Scalar value (from parameter_input steps or expression results)
        return value

expression.json handling: CWL ExpressionTools produce expression.json datasets. When a downstream CWL step receives one as a non-data input (scalar parameter), the JSON content should be extracted and passed as the scalar value. This matches the existing behavior in to_cwl() at modules.py:181-183.

Contingency: If expression.json detection causes issues (e.g., a CWL tool genuinely wants the file, not its contents), we may need to check the CWL input type from the tool's parameter model to decide. The CWL parameter model knows whether an input is File type vs scalar.

1c: Wire into ToolModule.execute()

Where: lib/galaxy/workflow/modules.py, ToolModule.execute() around line 2476

# In ToolModule.execute(), before the visit_input_values callback:
if tool.tool_type in CWL_TOOL_TYPES and not tool.has_galaxy_inputs:
    # Approach C: bypass legacy parameter system for CWL
    cwl_input_dict = build_cwl_input_dict(step, progress, trans)
    # ... handle valueFrom expressions (Step 2)
    # ... handle scatter (Step 5)
    # ... build JobInternalToolState and MappingParameters (Step 3)
else:
    # Legacy path for Galaxy tools
    visit_input_values(tool_inputs, execution_state.inputs, callback, ...)
    # ... existing code

Key considerations

  • step.input_connections_by_name maps input names to lists of WorkflowStepConnection objects. Each connection points to an upstream step's output.
  • Multiple connections to the same input indicate merged inputs (collections). Need the same merge logic as replacement_for_input_connections().
  • Step defaults from WorkflowStepInput.default_value need to be applied for unconnected inputs.

Step 2: valueFrom Expression Evaluation with CWL Input Dict

Goal: Evaluate CWL valueFrom JavaScript expressions against the CWL input dict, using to_cwl()/from_cwl() for Galaxy object conversion.

2a: Adapt evaluate_value_from_expressions for CWL dict inputs

The existing evaluate_value_from_expressions() (modules.py:2426-2458) works with execution_state.inputs containing FieldTypeToolParameter-wrapped values. For Approach C, the input dict already has CWL-native references.

def evaluate_cwl_value_from_expressions(
    step: WorkflowStep,
    cwl_input_dict: dict,
    progress: WorkflowProgress,
    trans,
) -> dict[str, Any]:
    """Evaluate CWL valueFrom expressions against a CWL input dict.

    Modifies cwl_input_dict in place with expression results.
    """
    value_from_map = {}
    for step_input in step.inputs:
        if step_input.value_from:
            value_from_map[step_input.name] = step_input.value_from

    if not value_from_map:
        return cwl_input_dict

    # Convert input dict to CWL format for JS evaluation
    hda_references = []
    step_state = {}
    for key, value in cwl_input_dict.items():
        step_state[key] = _ref_to_cwl(value, hda_references, trans, step)

    # Evaluate each valueFrom expression
    for key, value_from in value_from_map.items():
        context = step_state.get(key)
        result = do_eval(value_from, step_state, context=context)
        # Convert CWL result back to input dict reference
        cwl_input_dict[key] = _cwl_result_to_ref(
            result, hda_references, progress, trans
        )

    return cwl_input_dict

2b: Helper: _ref_to_cwl()

Convert an input dict reference ({src: "hda", id: N} or scalar) to CWL format for JavaScript evaluation:

def _ref_to_cwl(value, hda_references, trans, step):
    """Convert input dict reference to CWL format for expression evaluation."""
    if isinstance(value, dict) and "src" in value:
        if value["src"] == "hda":
            hda = trans.sa_session.get(model.HistoryDatasetAssociation, value["id"])
            return to_cwl(hda, hda_references=hda_references, step=step)
        elif value["src"] == "hdca":
            hdca = trans.sa_session.get(model.HistoryDatasetCollectionAssociation, value["id"])
            return to_cwl(hdca, hda_references=hda_references, step=step)
    # Scalars pass through
    return value

2c: Helper: _cwl_result_to_ref()

Convert expression results back to input dict format:

def _cwl_result_to_ref(value, hda_references, progress, trans):
    """Convert CWL expression result back to input dict reference."""
    result = from_cwl(value, hda_references=hda_references, progress=progress)
    if isinstance(result, model.HistoryDatasetAssociation):
        return {"src": "hda", "id": result.id}
    elif isinstance(result, model.HistoryDatasetCollectionAssociation):
        return {"src": "hdca", "id": result.id}
    return result  # scalar

Note: from_cwl() may call raw_to_galaxy() to create deferred HDAs from expression results that produce CWL File objects. This is correct -- the deferred HDA gets an ID, and we store {src: "hda", id: N} in the input dict.

2d: when_expression evaluation

when expressions (step conditional execution) also need adaptation. These are evaluated separately (modules.py:245-298) and return boolean. The existing code converts execution_state.inputs via to_cwl(). For CWL tools, use the same _ref_to_cwl() conversion on the CWL input dict.

Contingency: If the when_expression evaluation is too entangled with the existing code flow, it can be handled with a simple conditional: convert the CWL input dict refs to Galaxy objects for the existing to_cwl() path. This is less clean but works.


Step 3: Build JobInternalToolState and Set validated_param_combinations

Goal: Construct a JobInternalToolState from the CWL input dict and pass it through MappingParameters so job.tool_state gets set.

3a: Create JobInternalToolState

# After building cwl_input_dict and evaluating valueFrom:
internal_tool_state = JobInternalToolState(cwl_input_dict)
internal_tool_state.validate(tool, f"{tool.id} (workflow step)")

Potential issue: validate() calls create_job_internal_model() which uses the tool's parameter model. For CWL tools, this model uses CwlFileParameterModel etc. The validation needs to accept {src: "hda", id: N} dicts for file parameters. This should work if CwlFileParameterModel.py_type returns DataRequest and the pydantic model accepts this format. Verify during implementation.

3b: Build MappingParameters with validated state

# For non-scatter case (single param_combination):
param_combinations = [cwl_input_dict]
validated_param_combinations = [internal_tool_state]

mapping_params = MappingParameters(
    param_template=cwl_input_dict,
    param_combinations=param_combinations,
    validated_param_template=None,  # Not needed for workflow path
    validated_param_combinations=validated_param_combinations,
)

Why this works: execute_single_job() at execute.py:254-256 checks execution_slice.validated_param_combination and if present, sets job.tool_state = validated_param_combination.input_state. This is the same path the tool request API uses.

3c: Flow into execute()

execute(
    trans=trans,
    tool=tool,
    mapping_params=mapping_params,
    history=progress.effective_history,
    rerun_remap_job_id=None,
    collection_info=collection_info,  # from scatter handling (Step 5)
    workflow_invocation_uuid=progress.workflow_invocation.uuid,
    invocation_step=invocation_step,
    max_num_jobs=max_num_jobs,
    job_callback=job_callback,
    preferred_object_store_id=preferred_object_store_id,
)

Once job.tool_state is set, ToolEvaluator.set_compute_environment() reconstructs the JobInternalToolState, runtimeify converts dataset refs to File objects, and exec_before_job receives the runtimeified state -- the same path as tool-only invocation.


Step 4: Handle Input/Output Dataset Associations

Goal: Ensure Galaxy creates the right input/output dataset associations for the job.

4a: Input dataset registration

The workflow runner's ToolModule.execute() currently passes execution_state.inputs (with Galaxy objects) to tool.handle_single_execution(), which creates JobToInputDatasetAssociation entries. With Approach C, the input dict has {src: "hda", id: N} references, not Galaxy objects.

Two options:

  1. Convert refs back to objects for handle_single_execution: Before calling execute(), build the legacy params dict with actual HDA/HDCA objects that _execute() expects for input dataset association.

  2. Let the tool request API path handle it: If the tool request API path already creates input associations from {src, id} references (via the validated state), the workflow path should match. Verify this.

Contingency: If input dataset association is broken, add a post-processing step that reads job.tool_state and creates JobToInputDatasetAssociation entries for each {src: "hda", id: N} reference.

4b: Output dataset creation

Output datasets are created by tool.handle_single_execution()DefaultToolAction.execute(). This creates output HDAs based on tool.outputs. This is independent of the input format and should work unchanged.


Step 5: Scatter / Collection Mapping

Goal: Make CWL scatter work with the new CWL input dict format.

CWL scatter maps to Galaxy's implicit collection mapping. The workflow runner uses _find_collections_to_match() to identify which inputs have collections that should be iterated over.

5a: Identify scatter inputs

For CWL tools, scatter inputs are marked via step_input.scatter_type (set during workflow import from CWL scatter declarations).

def find_cwl_scatter_collections(
    step: WorkflowStep,
    cwl_input_dict: dict,
    trans,
) -> tuple[CollectionsToMatch, dict]:
    """Identify collections for CWL scatter and build CollectionsToMatch."""
    collections_to_match = CollectionsToMatch()

    for step_input in step.inputs:
        name = step_input.name
        scatter_type = step_input.scatter_type or "dotproduct"

        if scatter_type == "disabled":
            continue
        if name not in cwl_input_dict:
            continue

        ref = cwl_input_dict[name]
        if not isinstance(ref, dict) or ref.get("src") not in ("hdca", "dce"):
            continue

        # Look up the collection
        if ref["src"] == "hdca":
            hdca = trans.sa_session.get(model.HistoryDatasetCollectionAssociation, ref["id"])
            if hdca and hasattr(hdca, "collection") and hdca.collection.allow_implicit_mapping:
                collections_to_match.add(name, hdca)

    return collections_to_match

5b: Expand param_combinations for scatter

When scatter produces multiple execution slices, each slice has different input values (individual elements from the scattered collection). Build param_combinations and validated_param_combinations as lists -- one entry per slice.

if collections_to_match.has_collections():
    matched_collections = dataset_collection_manager.match_collections(collections_to_match)
    collection_info = matched_collections

    param_combinations = []
    validated_param_combinations = []

    for iteration_elements in collection_info.slice_collections():
        slice_dict = dict(cwl_input_dict)  # shallow copy
        for name, element in iteration_elements.items():
            # Replace collection ref with individual element ref
            slice_dict[name] = _galaxy_to_cwl_ref(element.element_object)

        param_combinations.append(slice_dict)
        state = JobInternalToolState(slice_dict)
        state.validate(tool, f"{tool.id} (workflow scatter slice)")
        validated_param_combinations.append(state)
else:
    collection_info = None
    param_combinations = [cwl_input_dict]
    validated_param_combinations = [internal_tool_state]

5c: Contingency: Simplified scatter first

If collection matching proves complex, start with a simpler approach:

  • Support only dotproduct scatter (the only type the current code asserts)
  • Support only single-variable scatter initially
  • Add multi-variable and nested scatter later

The existing test test_scatter_wf1_v1 tests single-variable scatter -- start there.


Step 6: Remove the exec_before_job Fallback

Goal: Once all CWL workflow tests pass with Approach C, remove the legacy fallback path.

6a: Remove the dual-path

# Before:
if validated_tool_state is not None:
    input_json = validated_tool_state.input_state
else:
    input_json = self.param_dict_to_cwl_inputs(param_dict, local_working_directory)

# After:
assert validated_tool_state is not None, "CWL tools require validated_tool_state"
input_json = validated_tool_state.input_state

6b: Dead code removal

With the fallback gone, these become dead code:

  • param_dict_to_cwl_inputs() on CWL tool classes
  • to_cwl_job() in representation.py
  • galactic_flavored_to_cwl_job() in representation.py
  • dataset_wrapper_to_file_json() in representation.py
  • TYPE_REPRESENTATIONS dict in representation.py
  • FieldTypeToolParameter in basic.py (if no other callers)
  • CWL-specific Conditional/Repeat handling in representation.py

6c: FieldTypeToolParameter removal

Check for remaining callers:

  • modules.py:2567 -- isinstance(input, FieldTypeToolParameter) in the visit_input_values callback. With Approach C, CWL tools take the new branch; this isinstance check only fires for legacy Galaxy tools (which never use FieldTypeToolParameter). Safe to remove.
  • modules.py:2647 -- expression_callback. Same situation.
  • basic.py:2924 -- from_json(). Only called when deserializing tool state with FieldTypeToolParameter inputs. Dead with new path.

Preserve: raw_to_galaxy() (now in cwl_runtime.py per setup step 5). Still needed by from_cwl() for valueFrom expression results.


Step 7: Clean Up basic.py

Goal: Remove all CWL-specific code from basic.py.

  • Remove FieldTypeToolParameter class (basic.py:2907-2970)
  • Remove raw_to_galaxy() (already moved to cwl_runtime.py)
  • Remove CWL imports and TYPE_REPRESENTATIONS references

Result: basic.py has zero CWL-specific code. The CWL branch modifies basic.py not at all.


Testing Strategy

Unit tests (new)

  1. test_build_cwl_input_dict(): Mock step connections and progress outputs. Verify dict has correct {src, id} refs for HDAs, scalar values for parameters, parsed JSON for expression.json HDAs.

  2. test_evaluate_cwl_value_from_expressions(): Given a CWL input dict with HDA refs and a valueFrom expression, verify the expression sees CWL File objects and results are converted back to refs.

  3. test_cwl_scatter_expansion(): Given a CWL input dict with HDCA ref and scatter_type=dotproduct, verify param_combinations has one entry per collection element.

Integration tests (existing)

All tests in test_workflows_cwl.py must pass:

Test What it exercises
test_simplest_wf Single-step workflow, File input/output
test_load_ids Multi-step subworkflow (search.cwl#main)
test_count_line1_v1 Two-step pipeline (wc -> parseInt ExpressionTool)
test_count_line1_v1_json Same with JSON job input
test_count_line2_v1 Two-step with different wiring
test_count_lines3_v1 Collection input → scatter → expression.json outputs
test_count_lines4_v1 Multi-input workflow with collection output
test_count_lines4_json Same with JSON job input
test_scatter_wf1_v1 Explicit CWL scatter

Test gaps to fill

  • valueFrom expressions in workflow steps: No existing test. Add one using a CWL workflow with valueFrom on a step input.
  • Secondary files between workflow steps: A tool produces output with secondaryFiles, next step consumes it. Verify secondary files survive the runtimeify path.
  • when expressions: No existing test for CWL when. Add if CWL branch supports it.
  • Subworkflows: test_load_ids tests import but doesn't fully exercise execution. Add execution test.

Recommended test execution order

  1. Run test_simplest_wf first -- simplest case, most likely to work
  2. Run test_count_line1_v1 -- tests ExpressionTool and multi-step
  3. Run test_scatter_wf1_v1 -- tests scatter
  4. Run remaining tests
  5. Run CWL conformance tests if available

Implementation Order

Step 1: build_cwl_input_dict()
    ├── 1a: New function
    ├── 1b: _galaxy_to_cwl_ref() helper
    └── 1c: Wire into ToolModule.execute()
                    │
                    ▼
Step 2: valueFrom evaluation
    ├── 2a: evaluate_cwl_value_from_expressions()
    ├── 2b: _ref_to_cwl() helper
    ├── 2c: _cwl_result_to_ref() helper
    └── 2d: when_expression adaption
                    │
                    ▼
Step 3: JobInternalToolState + MappingParameters
    ├── 3a: Create and validate state
    ├── 3b: Build MappingParameters
    └── 3c: Wire into execute()
                    │
                    ▼
Step 4: Input/output associations
    ├── 4a: Input dataset registration
    └── 4b: Output dataset creation
                    │
                    ▼
Step 5: Scatter support
    ├── 5a: Identify scatter inputs
    ├── 5b: Expand param_combinations
    └── 5c: (contingency) simplified scatter first
                    │
                    ▼
Step 6: Remove fallback + dead code
    ├── 6a: Remove dual-path
    ├── 6b: Dead code removal
    └── 6c: FieldTypeToolParameter removal
                    │
                    ▼
Step 7: Clean up basic.py

Steps 1-3 form the minimum viable workflow support. Run test_simplest_wf after Step 3.

Step 2 can be deferred if no existing workflow tests use valueFrom. Check by running tests after Step 3 with a stub that skips valueFrom evaluation.

Step 5 (scatter) is needed for test_scatter_wf1_v1 and test_count_lines3_v1.

Steps 6-7 are cleanup after all tests pass.


Contingencies

If CwlFileParameterModel validation fails

The JobInternalToolState.validate() call may fail if the CWL pydantic parameter model doesn't accept {src: "hda", id: N} dicts for file inputs.

Fix: Check what CwlFileParameterModel.py_type returns. It's DataRequest. The DataRequest type should accept {src, id} format. If not, adjust the pydantic template method to accept this format in the job_internal state representation.

Fallback: Skip validation temporarily (internal_tool_state = JobInternalToolState(cwl_input_dict) without .validate()) and fix the model later.

If expression.json detection is wrong

Some CWL tools might accept expression.json as a File input (not wanting the parsed content). The current _galaxy_to_cwl_ref() always parses expression.json HDAs.

Fix: Consult the CWL tool's parameter model to determine if the input expects File type. If so, return {src: "hda", id: N} even for expression.json datasets. Only parse the JSON for scalar-typed inputs.

If input dataset association breaks

The _execute() function in execute.py creates JobToInputDatasetAssociation entries by walking the params dict. If the CWL input dict format differs from what _execute() expects:

Fix: Add CWL-specific input association creation in execute_single_job() or a callback. Iterate over validated_param_combination.input_state, find {src: "hda", id: N} refs, and create associations.

If scatter with nested collections fails

CWL dotproduct scatter over a list collection is straightforward. Nested collections or multi-variable scatter may hit edge cases in CollectionsToMatch.

Fix: Implement nested scatter incrementally. Start with single-variable flat list scatter. The existing test infrastructure only tests this case anyway.

If to_cwl/from_cwl changes are needed for new format

The existing to_cwl() function works with Galaxy model objects. The new path stores {src, id} refs in the dict, so _ref_to_cwl() needs to look up objects by ID. This requires a database session.

If session access is awkward: Pre-resolve all {src, id} refs to Galaxy objects before calling expression evaluation. Build a lookup dict once, reuse it.


Unresolved Questions

  1. Does CwlFileParameterModel.pydantic_template() return a model that accepts {src: "hda", id: N} for job_internal state representation? If it just returns DataRequest without state-specific templates, validation may fail.

  2. How does _execute() in execute.py create JobToInputDatasetAssociation? Does it walk params looking for HDA objects, or does it use validated_param_combination? If the former, the CWL input dict with {src, id} refs won't have HDA objects to find.

  3. Do any CWL workflow tests exercise step.state.inputs (user-supplied step parameters at invocation time)? If so, Step 1's build_cwl_input_dict() needs to merge those into the dict.

  4. How do CWL subworkflow steps work in ToolModule.execute()? Subworkflows use a different module type (SubWorkflowModule), not ToolModule. Approach C only changes ToolModule.execute(). Subworkflows should be unaffected, but verify.

  5. What's the interaction between MappingParameters.validated_param_template and the workflow path? Tool request API sets both template and combinations. Workflow path may only need combinations. Does execute() require the template to be non-None?

  6. For collection inputs to CWL tools (non-scatter), how does the {src: "hdca", id: N} reference flow through runtimeify? The adapt_collection callback in setup_for_runtimeify() expects DataCollectionRequestInternal. Does the CWL runtimeify path handle this?

Dependency: cwltool

How Galaxy uses cwltool — every import, every API call, and whether those calls make sense for the migration.

Branch: cwl_on_tool_request_api_2


Dependency Overview

cwltool is the CWL reference runner. Galaxy uses it as a library, not a CLI — loading CWL documents, generating commands, staging files, and collecting outputs. Galaxy never runs cwltool as a subprocess.

All cwltool imports go through a single wrapper module (cwltool_deps.py) with try/except guards so cwltool remains an optional dependency. Only parser.py and schema.py call cwltool APIs directly. Everything else uses plain Python dicts that happen to match the CWL spec.

Package Dependencies

Package Purpose
cwltool CWL reference runner — loading, validation, command generation, output collection
schema_salad CWL schema library — URI resolution, YAML loading, source line tracking
ruamel.yaml YAML parsing (transitive via schema_salad) — CommentedMap type

Import Inventory

cwltool_deps.py — The Single Import Gateway

All cwltool access flows through lib/galaxy/tool_util/cwl/cwltool_deps.py:

Import Source Used By
main cwltool Availability check only
pathmapper cwltool JobProxy.stage_files() — PathMapper constructor
process cwltool fill_in_defaults(), stage_files()
workflow cwltool Type hint for Workflow objects
getdefault cwltool.context Filesystem access factory default
LoadingContext cwltool.context CWL document loading configuration
RuntimeContext cwltool.context Job execution configuration
relink_initialworkdir cwltool.job InitialWorkDirRequirement handling
StdFsAccess cwltool.stdfsaccess Filesystem access for fill_in_defaults()
load_tool cwltool fetch_document(), make_tool()
command_line_tool cwltool Imported but not directly referenced
default_loader cwltool.load_tool Raw YAML/JSON document loader
resolve_and_validate_document cwltool.load_tool CWL document validation
Process cwltool.process Base class type hint
CWLObjectType cwltool.utils Type alias for CWL data dicts
JobsType cwltool.utils Type alias for Job objects
normalizeFilesDirs cwltool.utils Imported but not referenced in Galaxy code
visit_class cwltool.utils Recursive CWL object visitor
ref_resolver schema_salad URI↔path conversion
sourceline schema_salad add_lc_filename() for in-memory CWL objects
yaml_no_ts schema_salad.utils YAML loading without timestamp parsing
CommentedMap ruamel.yaml.comments Type for parsed CWL documents

Also: beta_relaxed_fmt_check = True (constant) and needs_shell_quoting (regex).

Other Files with Direct cwltool Imports

File Import Purpose
lib/galaxy_test/base/cwl_location_rewriter.py LoadingContext, default_loader, pack, visit_field Test utility for rewriting CWL locations
test/unit/tool_util/test_cwl.py schema_salad (via cwltool_deps) Accessing ValidationException for test assertions

API Usage by Galaxy Module

1. schema.py — CWL Document Loading

SchemaLoader wraps cwltool's three-phase loading pipeline:

Phase 1: Fetch

load_tool.fetch_document(uri, loadingContext=loading_context)
# Returns: (LoadingContext, CommentedMap, str)
  • Called in raw_process_reference() with a file:// URI
  • Called in raw_process_reference_for_object() with an in-memory CommentedMap
  • The loading_context is configured with strict, do_validate, enable_dev=True, do_update=True, relax_path_checks=True

Phase 2: Validate

resolve_and_validate_document(loading_context, process_object, uri)
# Returns: (LoadingContext, str)
  • Resolves $import, $include, validates against CWL schema

Phase 3: Instantiate

load_tool.make_tool(uri, loading_context)
# Returns: Process (CommandLineTool | ExpressionTool | Workflow)
  • Returns a concrete cwltool Process object

Two loader instances:

  • schema_loader — strict validation (tool loading)
  • non_strict_non_validating_schema_loader — lenient (job execution, output collection)

Assessment: Clean, correct usage. The three-phase pipeline is cwltool's intended API. Galaxy's two-loader pattern correctly separates validation strictness for loading vs runtime.


2. parser.py — The Core Proxy Layer

This is where nearly all cwltool interaction happens. Three proxy classes wrap cwltool objects.

2.1 ToolProxy — Wraps cwltool.process.Process

Stored reference:

self._tool = tool  # Process instance

Process attributes accessed:

Attribute Type Usage
.tool dict Raw CWL definition — id, inputs, outputs, class, doc, label, requirements, cwlVersion
.metadata dict metadata["cwlVersion"] for serialization
.inputs_record_schema dict {"type": "record", "fields": [...]} — input type definitions
.outputs_record_schema dict Same structure for outputs
.schemaDefs dict Maps type names to resolved schema definitions
.requirements list CWL requirements (modified in-place by _hack_cwl_requirements)
.hints list CWL hints (DockerRequirement moved here)

Key operations on Process:

  1. Format stripping (__init__, line 145-148): Removes "format" from input field definitions so cwltool won't complain about missing format in input data. This is a workaround — Galaxy doesn't track CWL format URIs on datasets.

  2. Schema definition resolution (input_fields, line 293-294): Looks up input_type in schemaDefs to resolve named types (e.g., SchemaDefRequirement types).

  3. Serialization (to_persistent_representation): Serializes .tool dict + .requirements + .metadata["cwlVersion"] to JSON for database storage.

  4. DockerRequirement hack (_hack_cwl_requirements, line 893-901): Moves DockerRequirement from .requirements to .hints so cwltool doesn't try to run containers — Galaxy handles containerization independently.

Assessment: All attribute access is on well-established cwltool Process properties. The format stripping and Docker hack are legitimate bridging concerns. The serialization approach (persisting .tool dict) works because cwltool can reconstruct a Process from a raw CWL dict via fetch_document() + make_tool().

2.2 JobProxy — Wraps cwltool.job.Job (lazily)

This is the heaviest cwltool integration point.

_normalize_job() — Preparing inputs for cwltool (line 376-391):

runtime_context = RuntimeContext({})
make_fs_access = getdefault(runtime_context.make_fs_access, StdFsAccess)
fs_access = make_fs_access(runtime_context.basedir)
process.fill_in_defaults(self._tool_proxy._tool.tool["inputs"], self._input_dict, fs_access)
visit_class(self._input_dict, ("File", "Directory"), pathToLoc)

Calls:

  1. RuntimeContext({}) — empty context just to get filesystem access factory
  2. getdefault(runtime_context.make_fs_access, StdFsAccess) — get fs_access class with StdFsAccess as default
  3. process.fill_in_defaults(inputs_list, input_dict, fs_access) — fills default values into _input_dict in-place
  4. visit_class(obj, class_names, callback) — converts "path" keys to "location" in File/Directory objects

Assessment: Correct API usage. fill_in_defaults is cwltool's standard way to apply CWL default values. visit_class is the standard recursive visitor. The pathToLoc conversion is necessary because Galaxy internally uses path but cwltool expects location.

_ensure_cwl_job_initialized() — Creating the cwltool Job (line 354-374):

runtimeContext = RuntimeContext({
    basedir=job_directory,
    select_resources=self._select_resources,
    outdir=os.path.join(job_directory, "working"),
    tmpdir=os.path.join(job_directory, "cwltmp"),
    stagedir=os.path.join(job_directory, "cwlstagedir"),
    use_container=False,
    beta_relaxed_fmt_check=beta_relaxed_fmt_check,
})
cwl_tool_instance = copy.copy(self._tool_proxy._tool)
cwl_tool_instance.inputs_record_schema = copy.deepcopy(cwl_tool_instance.inputs_record_schema)
self._cwl_job = next(cwl_tool_instance.job(self._input_dict, self._output_callback, runtimeContext))

Key design decisions:

  • use_container=False — Galaxy wraps the command in its own container layer later
  • select_resources=self._select_resources — callback that injects SENTINEL_GALAXY_SLOTS_VALUE (1.480231396) for cores since the real slot count isn't known at job-preparation time
  • Shallow copy of Process + deep copy of inputs_record_schema — because Process.job() mutates inputs_record_schema in place (not thread-safe)
  • next() on the generator — cwltool's job() returns a generator, Galaxy only needs the first (and only) job

Assessment: The copy pattern is a legitimate workaround for cwltool's in-place mutation. The GALAXY_SLOTS sentinel hack is fragile but necessary — cwltool needs a concrete number for ResourceRequirement evaluation at job-construction time. The use_container=False is correct — Galaxy's container system handles Docker/Singularity.

Job object properties accessed (on the object returned by Process.job()):

Property Type CommandLineTool ExpressionTool
command_line list[str] Command fragments N/A (checked via hasattr)
stdin str | None Stdin redirect path N/A
stdout str | None Stdout redirect path N/A
stderr str | None Stderr redirect path N/A
environment dict EnvVarRequirement vars N/A
generatefiles dict {"listing": [...]} for InitialWorkDirRequirement N/A
pathmapper PathMapper Input file path mapping N/A (checked via hasattr)
inplace_update bool InlineJavascriptRequirement flag N/A

Assessment: These are all standard cwltool Job properties. Galaxy distinguishes CommandLineTool from ExpressionTool via hasattr(cwl_job, "command_line") which is the correct check — ExpressionTool jobs don't have a command_line attribute.

Job methods called:

Method When Signature
collect_outputs(outdir, rcode) CommandLineTool post-execution Returns dict[str, CWLObjectType] — output name to CWL value mapping
run(RuntimeContext({})) ExpressionTool execution Executes JS expression, calls _output_callback(out, status)

Assessment: Correct usage. For CommandLineTools, collect_outputs evaluates output glob patterns against the working directory. For ExpressionTools, run() executes the JavaScript and delivers results via callback. The empty RuntimeContext for expression execution is fine — expressions don't need runtime configuration.

_select_resources() callback (line 433-440):

def _select_resources(self, request, runtime_context=None):
    new_request = request.copy()
    new_request["cores"] = SENTINEL_GALAXY_SLOTS_VALUE
    return new_request

cwltool calls this during job construction to resolve ResourceRequirement. Galaxy substitutes a sentinel float for cores, then replaces it back with $GALAXY_SLOTS in the command_line property:

return [fragment.replace(str(SENTINEL_GALAXY_SLOTS_VALUE), "$GALAXY_SLOTS") for fragment in command_line]

Assessment: This is a hack but functional. The sentinel value (1.480231396) is unlikely to appear naturally. The real fix would be deferring job construction to the compute node where slot count is known, but that would require architectural changes.

_output_callback(out, process_status) (line 486-493):

cwltool's callback contract: (Optional[CWLObjectType], str). Galaxy stores the output and status, checking for "success".

Assessment: Correct callback implementation matching cwltool's expected signature.

stage_files() (line 541-564):

# Input files via pathmapper
process.stage_files(cwl_job.pathmapper, stageFunc, ignore_writable=True, symlink=False)

# InitialWorkDirRequirement files
generate_mapper = pathmapper.PathMapper(
    cwl_job.generatefiles["listing"], outdir, outdir, separateDirs=False
)
process.stage_files(generate_mapper, stageFunc, ignore_writable=inplace_update, symlink=False)
relink_initialworkdir(generate_mapper, outdir, outdir, inplace_update=inplace_update)

Galaxy's stageFunc creates symlinks (os.symlink). Two passes:

  1. Input files — symlinked from Galaxy dataset paths to cwltool staging
  2. Generated files (InitialWorkDirRequirement) — staged into working directory, then relinked

Assessment: This follows cwltool's own staging pattern. The separateDirs=False is correct for Galaxy's flat working directory. The ignore_writable=True for inputs prevents cwltool from trying to copy writable files. The symlink=False parameter to process.stage_files means Galaxy's custom stageFunc handles linking, not cwltool's default.

rewrite_inputs_for_staging() (line 393-431):

This method has a commented-out block for pathmapper-based rewriting, with an active fallback that manually symlinks files whose location doesn't match their basename. This is a workaround for files that cwltool's staging doesn't handle (e.g., expression tools without a pathmapper).

Assessment: The commented-out code suggests this is incomplete. The active fallback is functional but inelegant — it manually traverses the input dict looking for location/basename mismatches.

save_job() / load_job_proxy() — Persistence for output collection (line 508-516, 798-808):

# save_job writes:
{"tool_representation": proxy.to_persistent_representation(), "job_inputs": input_dict, "output_dict": output_dict}

# load_job_proxy reads it back:
cwl_tool = tool_proxy_from_persistent_representation(persisted_tool)
return cwl_tool.job_proxy(job_inputs, output_dict, job_directory)

This serializes the full CWL context to .cwl_job.json so the output collection script can reconstruct a JobProxy post-execution.

Assessment: This round-trip works because to_persistent_representation() captures the raw CWL tool dict, and from_persistent_representation() feeds it back through cwltool's loading pipeline (with strict_cwl_validation=False for speed). The non-strict loader avoids re-validating at output collection time.

2.3 WorkflowProxy — Wraps cwltool.workflow.Workflow

Stored reference:

self._workflow = workflow  # cwltool.workflow.Workflow

Workflow attributes accessed:

Attribute Type Usage
.tool dict Raw CWL workflow dict — id, inputs, outputs
.steps iterable WorkflowStep objects

WorkflowStep attributes accessed:

Attribute Type Usage
.tool dict Step definition — run, class, inputs, scatter, scatterMethod, when
.id str Step identifier
.requirements list Step-level requirements
.hints list Step-level hints
.embedded_tool Process | Workflow The nested Process or sub-Workflow for inline tools

Assessment: All access is on standard cwltool Workflow/WorkflowStep properties. Galaxy reads these to convert CWL workflows into Galaxy's internal workflow format. No mutation of cwltool objects here.


3. runtime_actions.py — Output Collection

Direct cwltool calls: Only ref_resolver.uri_file_path(location) for converting file:// URIs to filesystem paths.

Indirect cwltool calls: job_proxy.collect_outputs() which delegates to cwltool's Job.collect_outputs() or Job.run().

The rest is pure Galaxy logic — moving files, handling secondary files, writing galaxy.json metadata.

Assessment: Clean separation. The only cwltool dependency is through JobProxy (which is correct) and the URI resolver (which is a simple utility).


4. representation.py — Zero cwltool API Calls

Despite being the CWL↔Galaxy translation layer, this module never calls cwltool directly. It constructs plain Python dicts matching the CWL spec ({"class": "File", "location": ...}). The only cwltool-adjacent reference is _cwl_tool_proxy.input_fields() which is Galaxy's ToolProxy method.

Assessment: This is actually a good thing. The representation layer's problems are conceptual (the round-trip hack), not dependency-related. Eliminating it doesn't remove any cwltool API usage.


5. util.py — Zero cwltool API Calls

galactic_job_json(), output_to_cwl_json(), and the upload target classes all work with plain CWL-spec dicts. No cwltool dependency.


6. parser/cwl.py (CwlToolSource) — Indirect Only

All cwltool access goes through ToolProxy methods. The one exception is self.tool_proxy._tool.tool.get("successCodes") which reads the raw CWL dict for exit code parsing.


cwltool Object Lifecycle

                    LOADING
                    =======
SchemaLoader.raw_process_reference(path)
    → load_tool.fetch_document(uri, loading_context)
    → RawProcessReference(loading_context, process_object, uri)

SchemaLoader.process_definition(raw_ref)
    → resolve_and_validate_document(loading_context, process_object, uri)
    → ResolvedProcessDefinition(loading_context, uri)

SchemaLoader.tool(process_definition)
    → load_tool.make_tool(uri, loading_context)
    → Process (CommandLineTool | ExpressionTool | Workflow)

                    WRAPPING
                    ========
_cwl_tool_object_to_proxy(Process)
    → CommandLineToolProxy(Process) | ExpressionToolProxy(Process)

_hack_cwl_requirements(Process)
    → Moves DockerRequirement from .requirements to .hints

                    JOB CREATION
                    ============
ToolProxy.job_proxy(input_dict, output_dict, job_directory)
    → JobProxy.__init__()
        → _normalize_job()
            → process.fill_in_defaults(tool.inputs, input_dict, fs_access)
            → visit_class(input_dict, ("File","Directory"), pathToLoc)
        → (lazy) _ensure_cwl_job_initialized()
            → Process.job(input_dict, output_callback, runtime_context) → Job

                    COMMAND EXTRACTION
                    ==================
JobProxy.command_line    → Job.command_line (with GALAXY_SLOTS unsentineling)
JobProxy.stdin/stdout/stderr → Job.stdin/stdout/stderr
JobProxy.environment     → Job.environment

                    FILE STAGING
                    ============
JobProxy.stage_files()
    → process.stage_files(Job.pathmapper, stageFunc)
    → pathmapper.PathMapper(Job.generatefiles["listing"], ...)
    → process.stage_files(generate_mapper, stageFunc)
    → relink_initialworkdir(generate_mapper, ...)

                    PERSISTENCE
                    ===========
JobProxy.save_job()
    → Writes .cwl_job.json (tool repr + inputs + outputs)

                    OUTPUT COLLECTION
                    =================
load_job_proxy(job_directory)
    → Reads .cwl_job.json
    → tool_proxy_from_persistent_representation()
    → ToolProxy.job_proxy() → new JobProxy

JobProxy.collect_outputs(working_dir, exit_code)
    → CommandLineTool: Job.collect_outputs(outdir, rcode)
    → ExpressionTool: Job.run(RuntimeContext({})) → output via callback

Cross-Reference: Do the Calls Make Sense?

Correct and Clean

Call Verdict
load_tool.fetch_document() / resolve_and_validate_document() / make_tool() Standard three-phase loading pipeline. Correct.
process.fill_in_defaults() Standard API for applying CWL defaults. Correct.
visit_class() Standard recursive visitor. Correct.
Process.job()next() Standard way to create a cwltool Job. Correct.
Job.collect_outputs() Standard output collection for CommandLineTools. Correct.
Job.run() for ExpressionTools Standard expression execution. Correct.
process.stage_files() + PathMapper Standard file staging. Correct.
relink_initialworkdir() Standard InitialWorkDirRequirement handling. Correct.
ref_resolver.uri_file_path() Simple URI conversion. Correct.
use_container=False Galaxy handles containers. Correct.
strict_cwl_validation=False at runtime Avoid re-validating at execution time. Correct.

Workarounds / Hacks

Call Issue Assessment
_hack_cwl_requirements() — move DockerRequirement to hints cwltool would try to run Docker if it's a requirement Necessary hack. Galaxy's container system is authoritative.
Format stripping from inputs_record_schema cwltool validates input formats, Galaxy doesn't track CWL format URIs Necessary hack. Without this, cwltool rejects inputs missing format. Could be solved by tracking format URIs on datasets.
SENTINEL_GALAXY_SLOTS_VALUE in _select_resources Galaxy doesn't know slot count at job-preparation time Fragile hack. Works in practice because the sentinel is unlikely to collide. Better solution would defer Job construction to the compute node.
Shallow copy of Process + deep copy of inputs_record_schema Process.job() mutates inputs_record_schema in place Necessary workaround for cwltool's lack of thread safety.
rewrite_inputs_for_staging() fallback Manually symlinks when pathmapper isn't available (ExpressionTools) Incomplete — the pathmapper-based path is commented out. Works for simple cases.

Unused / Potentially Dead

Import Status
normalizeFilesDirs Imported in cwltool_deps.py, exported in __all__, but grep finds no usage in Galaxy code
command_line_tool Imported but never referenced directly

Implications for the Migration

What Changes

The migration to validated tool state changes how input data reaches JobProxy, not how JobProxy uses cwltool. Currently:

  • Legacy: Galaxy param_dict → to_cwl_job() → CWL input dict → JobProxy
  • New path: validated_tool_state.input_state → (needs runtimeify) → CWL input dict → JobProxy

JobProxy's cwltool API usage stays the same either way. The calls to fill_in_defaults, Process.job(), stage_files, and collect_outputs are unchanged.

What Stays

All cwltool API calls in parser.py are correct for the migration. The proxy layer is well-designed — it isolates cwltool behind clean interfaces. The three areas that remain:

  1. Document loading (schema.py) — Unchanged by migration
  2. Job execution (JobProxy) — Input format changes, API calls stay the same
  3. Output collection (runtime_actions.py via JobProxy) — Unchanged by migration

What the Runtimeify Step Must Produce

For _normalize_job() and subsequently Process.job() to work, the input dict must contain:

  • File inputs as {"class": "File", "path": "/absolute/path", ...} or {"class": "File", "location": "file:///path", ...} (the pathToLoc callback in _normalize_job converts path to location)
  • Directory inputs similarly
  • Scalar values as plain Python types
  • Optional null inputs omitted or set to None
  • fill_in_defaults will fill in any missing inputs that have CWL defaults

This is what the CWL-specific runtimeify must produce from JobInternalToolState (which has {src: "hda", id: N} references).

Potential Simplification

normalizeFilesDirs is imported but unused — could be cleaned up. The command_line_tool import is also dead. Neither affects functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment