You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Galaxy's CWL integration (branch cwl-1.0 at common-workflow-lab/galaxy) translates CWL's schema-based parameter model into Galaxy's opinionated tool parameter model (basic.py). This round-trip — CWL schema to Galaxy widgets to user input to Galaxy param_dict to CWL job JSON — is the root of the problem. It requires:
Every CWL type mapped to a Galaxy widget type (TYPE_REPRESENTATIONS)
Union types encoded as Galaxy conditionals with synthetic _cwl__type_/_cwl__value_ keys
A catch-all FieldTypeToolParameter added to basic.py
Reverse-engineering Galaxy DatasetWrappers back into CWL File objects (to_cwl_job(), galactic_flavored_to_cwl_job() in representation.py)
This adaption layer is hack after hack and deeply entangles CWL with Galaxy's core parameter infrastructure.
Target Pattern: YAML Tool Runtime
Galaxy's YAML-defined tools already demonstrate the architecture we want for CWL. They use a typed state transformation chain:
Each transition is schema-validated and unit-testable. The runtimeify() step converts dataset references ({src: "hda", id: N}) into CWL-style File objects with real paths — all driven by the tool's parameter model, not by reverse-engineering Galaxy wrappers.
YAML tools bypass Galaxy's legacy parameter parsing entirely: they set has_galaxy_inputs = False, use UserToolEvaluator, and build commands from validated state via JavaScript/CWL expressions.
Goal
Migrate CWL tool execution to follow this same pattern:
Tool request API entry: CWL tools accept requests via POST /api/jobs with CWL-native parameters. Galaxy provides a client layer translating dataset references (not raw file paths) into the API.
Validated tool state: The input flows through the typed state chain. JobInternalToolState is persisted on the job with dataset references.
CWL-specific runtimeify: At evaluation time, convert JobInternalToolState into CWL job inputs (File objects with paths, secondary files, format URIs) — analogous to what YAML tools do but richer to satisfy cwltool's expectations.
cwltool via JobProxy: Pass the runtimeified inputs to JobProxy, which delegates command building, file staging, and output collection to cwltool. This delegation is inherent to CWL — cwltool is the authoritative command builder — and differs structurally from YAML tools (which evaluate expressions directly).
What This Eliminates
The representation.py layer (to_cwl_job, galactic_flavored_to_cwl_job, dataset_wrapper_to_file_json)
CWL-specific parameter types in basic.py (FieldTypeToolParameter, TYPE_REPRESENTATIONS)
Galaxy's parameter expansion/population machinery for CWL tools (expand_meta_parameters, _populate_async)
The to_cwl() fallback path in workflow/modules.py for CWL tool execution
Ideally, the CWL branch modifies basic.py not at all.
What Remains Different from YAML Tools
CWL tools will not be identical to YAML tools. Key structural differences:
Command pre-computation: CWL delegates command building to cwltool at exec_before_job time via JobProxy. YAML tools evaluate expressions at _build_command_line time.
Output collection: CWL uses a post-execution relocate_dynamic_outputs.py script because cwltool runs its own output glob evaluation. YAML tools use Galaxy's standard metadata.
File staging: CWL uses cwltool's PathMapper for symlinks (needed for InitialWorkDirRequirement). YAML tools use Galaxy's compute_environment.input_path_rewrite().
Evaluator class: CWL uses ToolEvaluator (not UserToolEvaluator) since commands are pre-computed, not built from expressions.
Open Questions (with preliminary research — not yet reviewed)
1. Where should dataset→File conversion happen for CWL?
Answer: Both phases — generic runtimeify then CWL-specific enrichment.
Currently exec_before_job receives validated_tool_state.input_state containing raw {src: "hda", id: N} references and passes them directly to JobProxy. But JobProxy._normalize_job() (parser.py:376-391) expects File objects with path or location keys — a structural mismatch.
The two-phase approach:
Generic runtimeify() (convert.py:539-577) with setup_for_runtimeify() (runtime.py:50-123) converts raw dataset refs into DataInternalJson File objects with class, path, basename, format, size, etc. This is the same typed conversion YAML tools use — reusable, unit-testable.
CWL-specific enrichment in exec_before_job adds what cwltool needs beyond basic File objects: secondaryFiles, CWL format URIs, checksums, and anything else JobProxy._normalize_job() expects.
The generic phase can be called at exec_before_job time since inp_data is available there. This follows the YAML pattern while keeping CWL-specific concerns isolated.
2. How to handle secondary files in the new path?
Answer: Extend DataInternalJson and discover secondary files in adapt_dataset().
The current DataInternalJson (tool_util_models/parameters.py:594-612) has no secondaryFiles field — it's explicitly commented out. This is the key blocker.
Secondary files are stored in Galaxy's object store at {dataset.extra_files_path}/__secondary_files__/ with an ordering index at __secondary_files_index.json (util.py:41). The legacy dataset_wrapper_to_file_json() (representation.py:163-183) discovers them at execution time by enumerating this directory.
The fix:
Add secondaryFiles: Optional[List["DataInternalJson"]] to DataInternalJson
Enhance adapt_dataset() in setup_for_runtimeify() (runtime.py:77-94) to check the HDA's extra_files_path for __secondary_files__/, read the index, and attach them to the File object
This keeps secondary file logic centralized in the same callback that builds the primary File object, avoids duplication between YAML and CWL paths, and integrates naturally with cwltool's expectations (it processes secondaryFiles via visit_class() in _normalize_job()).
3. Can output collection move inside Galaxy's job finishing?
Answer: No — it must stay as an appended script. Several reasons:
cwltool needs the working directory: job_proxy.collect_outputs() (parser.py:495-506) invokes cwltool's native collect_outputs() which evaluates output glob patterns at runtime, directly on the filesystem where outputs were created. Galaxy's discover_outputs() (tools/init.py:2871-2916) uses pre-computed metadata — fundamentally different.
Compute node locality: The .cwl_job.json file and working directory live on the compute node. Moving collection to JobWrapper.finish() on the controller would require transferring both, plus having cwltool available on the controller.
CWL-specific metadata model: handle_outputs() (runtime_actions.py:69-230) handles secondaryFiles with custom index files, Directory outputs, CWL format URIs, and ExpressionTool JS execution — none of which Galaxy's discover_outputs() understands.
Container execution: The relocate script runs after the container exits but within the same job script, with working directory still accessible. This mirrors how Galaxy's standard metadata handling works outside containers (command_factory.py:163-165).
The appended script is the correct architectural choice. CWL output collection is fundamentally a cwltool responsibility requiring working directory context.
Plumb CWL tool execution — both direct invocation and workflow steps — through Galaxy's validated tool state chain and runtimeify infrastructure. Eliminate the representation.py adaption layer, FieldTypeToolParameter, and all CWL-specific code in basic.py.
The runtimeify visitor checks isinstance(parameter, DataParameterModel). CWL file parameters use CwlFileParameterModel / CwlDirectoryParameterModel which extend BaseGalaxyToolParameterModelDefinition, not DataParameterModel.
Fix: Extend the visitor's to_runtime_callback in convert.py to also check for CWL file/directory parameter types, or make those types inherit from DataParameterModel.
1e: Move raw_to_galaxy() to cwl_runtime.py
raw_to_galaxy() creates deferred HDAs from CWL File dicts. Currently in basic.py. Move it to cwl_runtime.py so basic.py cleanup is unblocked. The workflow path needs it for from_cwl() valueFrom expression results.
Step 2: Tool-Only Runtimeify
Wire runtimeify into ToolEvaluator.set_compute_environment() for CWL tools invoked via the tool request API.
2a: Call runtimeify in the evaluator
Where: evaluation.py, ToolEvaluator.set_compute_environment(), in the param_dict_style == "regular" branch between state reconstruction and execute_tool_hooks.
Widen the type signature: execute_tool_hooks() and exec_before_job() accept Optional[Union[JobInternalToolState, JobRuntimeToolState]]. Both expose .input_state.
2c: _normalize_job() is unchanged
JobProxy._normalize_job() still runs fill_in_defaults() (CWL defaults) and pathToLoc (convert path→location). Both are no-ops or safe on pre-runtimeified inputs. No changes needed.
Step 3: Workflow CWL Input Dict Construction
Build CWL input dicts directly from step connections, bypassing Galaxy's legacy parameter system (FieldTypeToolParameter, visit_input_values callback, etc.).
3a: build_cwl_input_dict()
Where: lib/galaxy/workflow/modules.py, new function
expression.json handling: ExpressionTools produce expression.json datasets. When downstream steps consume these as scalars, parse the JSON content. If the downstream input is actually File-typed, we may need to check the CWL parameter model — fix if/when a test breaks.
Start with single-variable dotproduct scatter. Multi-variable and nested scatter can come later — existing tests only cover single-variable.
Step 6: Input/Output Dataset Associations
6a: Input dataset registration
The _execute() function creates JobToInputDatasetAssociation entries by walking the params dict looking for HDA objects. With the new path, params has {src: "hda", id: N} refs, not objects.
Fix: Add CWL-specific input association creation. After job creation, iterate over job.tool_state, find {src: "hda", id: N} refs, and create associations. Or pre-resolve refs to objects before calling execute.
6b: Output dataset creation
Created by DefaultToolAction.execute() based on tool.outputs. Independent of input format — should work unchanged.
Steps 1-2 get tool-only CWL working through the new path.
Steps 3-5 get workflow CWL working.
Step 6 handles any association issues that surface during testing.
Step 7 is cleanup once everything works.
Steps 4-5 can be deferred if simple workflow tests pass without valueFrom/scatter. Just stub them out and add when tests need them.
New Files
File
Contents
lib/galaxy/tools/cwl_runtime.py
setup_for_cwl_runtimeify(), discover_secondary_files(), raw_to_galaxy() (moved from basic.py)
Modified Files
File
Changes
lib/galaxy/tool_util_models/parameters.py
Add CwlSecondaryFileJson, CwlDataInternalJson
lib/galaxy/tool_util/parameters/convert.py
Extend to_runtime_callback for CWL parameter types
lib/galaxy/tools/evaluation.py
Add runtimeify call in set_compute_environment()
lib/galaxy/tools/__init__.py
Widen exec_before_job type signature; eventually remove fallback
lib/galaxy/workflow/modules.py
Add build_cwl_input_dict(), CWL branch in ToolModule.execute(), expression helpers
discover_secondary_files(): Mock HDA with extra_files_path/__secondary_files__/. Verify File/Directory classification and path resolution.
CWL adapt_dataset: Given HDA with secondary files and CWL format, verify CwlDataInternalJson output.
runtimeify with CWL parameters: CWL parameter model + JobInternalToolState with HDA refs → verify File objects.
build_cwl_input_dict(): Mock step connections and progress. Verify {src, id} refs for HDAs, scalars for params, parsed JSON for expression.json.
evaluate_cwl_value_from_expressions(): Input dict with HDA refs + valueFrom → verify expression sees CWL Files and results convert back.
Scatter expansion: Input dict with HDCA ref + scatter → verify one param combination per element.
Integration Tests
All tests in test_workflows_cwl.py:
Test
Exercises
test_simplest_wf
Single-step, File I/O
test_load_ids
Multi-step subworkflow
test_count_line1_v1
Two-step + ExpressionTool
test_count_line1_v1_json
JSON job input
test_count_line2_v1
Different wiring
test_count_lines3_v1
Collection input → scatter → expression.json
test_count_lines4_v1
Multi-input + collection output
test_count_lines4_json
JSON job input
test_scatter_wf1_v1
Explicit CWL scatter
Test Gaps
valueFrom expressions in workflow steps
Secondary files between workflow steps
when expressions
Subworkflow execution
Unresolved Questions
CWL parameters as DataParameterModel?CwlFileParameterModel extends BaseGalaxyToolParameterModelDefinition, not DataParameterModel. The runtimeify visitor won't recognize CWL file params without a fix (Step 1d). Best approach — extend visitor, or make CWL types inherit DataParameterModel?
Directory inputs? CWL directories stored as tar archives (ext directory). Legacy code untars into _inputs dir. Where does untar happen in new path? Needs class: "Directory" + listing, not class: "File".
Collection inputs (non-scatter)?{src: "hdca", id: N} through runtimeify — adapt_collection has NotImplementedError cases. Do CWL tools receive non-scatter collection inputs?
Format stripping still needed?ToolProxy.__init__ strips format from CWL schema (parser.py:143-148). If we now provide EDAM format URIs, does cwltool validate them against the stripped schema? Mismatch risk.
compute_environment for secondary file paths? Does input_path_rewrite() work on extra_files_path subdirectories or only on HDA file names?
Expression tools in runtimeify? ExpressionTools have no command line (return ["true"]). Runtimeify still runs but the result flows to JobProxy.save_job() for output collection. Verify this path.
Symlink staging for secondary files? Legacy code symlinks primary + secondary into _inputs dir for basename-relative refs. With runtimeify, paths point to extra_files_path/__secondary_files__/. Does cwltool's PathMapper.stage_files() handle this, or do we need explicit symlinking?
input_dataset_collections wiring?set_compute_environment needs input_dataset_collections. job.io_dicts() returns (inp_data, out_data, out_collections) but not input collections directly. How does job.input_dataset_collections map to the format setup_for_runtimeify expects?
CwlFileParameterModel.pydantic_template() acceptance? Does it accept {src: "hda", id: N} for job_internal state representation? py_type = DataRequest. Verify DataRequest matches the dict format.
How does _execute() create JobToInputDatasetAssociation? Does it walk params for HDA objects, or use validated_param_combination? If the former, CWL input dict with {src, id} refs won't work without a fix.
MappingParameters.validated_param_template — does execute() require it to be non-None, or does only validated_param_combinations matter?
Research document covering the CWL integration branch (cwl-1.0 rebased onto Galaxy dev) and the recent WIP commits migrating to the tool request API. Written from code analysis of branch cwl_on_tool_request_api_2.
The branch has ~52 commits from the legacy cwl-1.0 branch (rebased onto Galaxy dev post release_26.0) plus 4 new WIP commits that begin migrating CWL tool execution to the modern tool request API.
Legacy commits (~48): Implement CWL tool/workflow parsing, parameter translation, execution via cwltool, conformance test infrastructure, output collection, and many bug fixes.
New commits (4):
d2f9a20b36 WIP: by-pass legacy Galaxy parameter handling for CWL tools
d968749217 Type error...
d4d68d2a9b Fix persisting CWL tools for tool requests
c290f52d83 WIP: migrate CWL tool running to tool request API
Architecture Overview
The CWL integration wraps the reference CWL runner (cwltool) inside Galaxy's tool framework. The core design pattern is a proxy layer that adapts CWL concepts to Galaxy concepts:
CWL Tool Description (.cwl file)
↓ cwltool parses
ToolProxy (wraps cwltool.process.Process)
↓ adapted to
Galaxy Tool (CwlTool/GalacticCwlTool extends Tool)
↓ parameters adapted via
Galaxy Parameter System (basic.py FieldTypeToolParameter, conditionals, repeats)
↓ reverse-converted at execution via
to_cwl_job() / galactic_flavored_to_cwl_job()
↓ fed to
JobProxy (wraps cwltool.job.Job)
↓ extracts
Shell command + environment + file staging
↓ executed by
Galaxy job runner (standard execution pipeline)
↓ outputs collected by
handle_outputs() → relocate_dynamic_outputs.py
The fundamental problem (from PROBLEM_AND_GOAL.md): CWL has flexible, schema-based parameters. Galaxy has opinionated, inflexible tool parameters. The legacy branch adapted CWL → Galaxy parameters → back to CWL, which required extensive hacking. The new commits bypass this round-trip.
Key hack: _hack_cwl_requirements() moves DockerRequirement from requirements to hints so Galaxy's own container system handles containerization instead of cwltool.
InputInstance (Simplified by New Commits)
Before new commits: InputInstance had input_type, collection_type, array, area attributes and a complex to_dict() producing Galaxy form widgets with conditionals and selects.
After new commits: InputInstance is stripped to just name, label, description. The function _outer_field_to_input_instance() no longer maps CWL types to Galaxy widget types.
CWL type system is mapped to Galaxy's parameter types:
CWL Type
Galaxy Representation
Galaxy Widget
File
DATA
DataToolParameter
Directory
DATA
DataToolParameter
string
TEXT
TextToolParameter
int/long
INTEGER
IntegerToolParameter
float/double
FLOAT
FloatToolParameter
boolean
BOOLEAN
BooleanToolParameter
array
DATA_COLLECTION (list)
DataCollectionToolParameter
record
DATA_COLLECTION (record)
DataCollectionToolParameter
enum
TEXT or SELECT
SelectToolParameter
Any/union
FIELD or CONDITIONAL
FieldTypeToolParameter or Conditional
null
(no param)
—
Union types are the worst offender — a CWL input like [null, File, int] gets mapped to a Galaxy conditional with a select dropdown (_cwl__type_) to pick the active type, and a nested value input (_cwl__value_).
FieldTypeToolParameter (basic.py:2907)
The field type is the catch-all CWL parameter in Galaxy's parameter system:
classFieldTypeToolParameter(ToolParameter):
deffrom_json(self, value, trans, other_values=None):
# Handles: None, dicts with "src", File class dicts, raw valuesifvalue.get("class") =="File":
returnraw_to_galaxy(trans.app, trans.history, value)
returnself.to_python(value, trans.app)
defto_json(self, value, app, use_security):
# Serializes: None, dicts with src/id, File class dicts, raw values
This parameter type handles the kitchen-sink nature of CWL inputs but is inherently hacky — it's trying to encode arbitrary structured data within Galaxy's parameter framework.
For conditional inputs → reads _cwl__type_ discriminator, extracts _cwl__value_
For data inputs → calls dataset_wrapper_to_file_json() which creates CWL File objects with location, size, checksum, secondary files
For data_collection → collection_wrapper_to_array() or collection_wrapper_to_record()
For primitives → direct type conversion
galactic_flavored_to_cwl_job(tool, param_dict, local_working_directory) — Simpler variant for tools with gx:Interface extensions.
to_galaxy_parameters(tool, as_dict) — Reverse: CWL job outputs → Galaxy tool state. Used when workflow steps receive CWL results.
The Problem
The round-trip (CWL schema → Galaxy widgets → user input → Galaxy param_dict → CWL job JSON) loses type fidelity, requires extensive special-casing, and touches basic.py which is core Galaxy infrastructure. Every CWL type needs Galaxy UI representation, serialization/deserialization, and reverse-mapping logic.
POST /api/jobs (tool_request_raw)
↓
Tool.handle_input_async() with has_galaxy_inputs=False:
- SKIPS expand_meta_parameters_async()
- SKIPS _populate_async()
- Passes raw input state through as-is
↓
Celery task (serializes tool via persistent representation)
↓
CwlCommandBindingTool.exec_before_job(validated_tool_state):
input_json = validated_tool_state.input_state ← direct, no reverse-engineering
... rest same as legacy
How CWL tools execute inside Galaxy's job infrastructure — from API request to command execution to output collection. Written to inform the migration toward a runtimeify-style approach using the tool request API and validated tool state.
CWL tool execution uses a fundamentally different architecture from YAML tools:
YAML tools use UserToolEvaluator with param_dict_style="json". The evaluator calls runtimeify() to convert JobInternalToolState into a JobRuntimeToolState with CWL-style File objects. Command building uses do_eval() (JavaScript/CWL expressions) against these inputs. Everything happens inside the evaluator — no special pre-processing hook needed.
CWL tools use the standard ToolEvaluator with param_dict_style="regular". The critical work happens in CwlCommandBindingTool.exec_before_job(), which extracts validated_tool_state.input_state, creates a JobProxy wrapping cwltool, and pre-computes the entire command line, stdin/stdout/stderr, and environment variables. These are stashed in param_dict as __cwl_command and __cwl_command_state. The evaluator just uses those pre-computed values verbatim.
The key architectural difference: YAML tools build commands at evaluation time using expressions. CWL tools delegate command building to cwltool via a proxy object and store the result before evaluation even starts.
YAML Tool Runtime (The Target Pattern)
Understanding this is critical because the goal is to make CWL execution follow a similar pattern.
has_galaxy_inputs Flag (tools/__init__.py:1725,1734)
Set during parse_inputs():
self.has_galaxy_inputs=False# line 1725ifpages.inputs_defined:
self.has_galaxy_inputs=True# line 1734
For CWL tools, CwlPageSource.inputs_style is "cwl", which means inputs_defined behavior depends on the page source implementation. With the new commits, CWL tools have has_galaxy_inputs = False.
# Line 254-256:ifexecution_slice.validated_param_combination:
tool_state=execution_slice.validated_param_combination.input_statejob.tool_state=tool_state
This persists the JobInternalToolState.input_state dict as JSON on the Job model. For CWL tools using the new path, this is the raw CWL-compatible input dict (dataset references as {src: "hda", id: <int>}).
Celery Serialization
Tool request API uses Celery tasks. CWL tools must round-trip through serialization:
ToolProxy.to_persistent_representation() serializes the full CWL tool description
QueueJobs schema carries tool_id and tool_uuid
On the worker, create_tool_from_representation() reconstructs the tool
This was fixed in commit d4d68d2a9b
Phase 3: Job Preparation and Evaluation
JobWrapper.prepare() (jobs/__init__.py:1247-1314)
Called by the job runner when the job is ready to execute:
tool_evaluator=self._get_tool_evaluator(job) # line 1270tool_evaluator.set_compute_environment(compute_environment, ...) # line 1272
(self.command_line, self.version_command_line,
self.extra_filenames, self.environment_variables,
self.interactivetools) =tool_evaluator.build() # line 1274
Evaluator Selection (jobs/__init__.py:1402-1415)
ifself.tool.base_commandorself.tool.shell_command:
klass=UserToolEvaluator# YAML tools have theseelse:
klass=ToolEvaluator# CWL tools don't
CWL tools always get ToolEvaluator (not UserToolEvaluator).
Critical observation: The input_json at step 1 is now validated_tool_state.input_state (the new path). In the legacy path, this would have been self.param_dict_to_cwl_inputs(param_dict, local_working_directory) which reverse-engineers CWL inputs from Galaxy's wrapped parameter dict via to_cwl_job() or galactic_flavored_to_cwl_job().
$GALAXY_SLOTS Handling
needs_shell_quoting_hack() exempts $GALAXY_SLOTS from quoting. But there's a deeper hack: cwltool needs a concrete number for ResourceRequirement.coresMin at job-construction time. JobProxy._select_resources() substitutes a sentinel value (1.480231396), and command_line property replaces it back with $GALAXY_SLOTS:
Key: use_container=False — Galaxy's own containerization (Docker/Singularity) wraps the command later in build_command(). cwltool must not try to run containers.
For CommandLineTools: delegates to cwltool's collect_outputs() which evaluates output glob patterns.
For ExpressionTools: calls cwl_job.run() to execute the JavaScript expression.
This galaxy.json contains per-output metadata: created_from_basename, ext, format, and for collections, elements.
Secondary Files
Stored in dataset_{id}_files/__secondary_files__/ with an index file:
{"order": ["file.idx", "file.bai"]}
The move_output() function handles secondary file naming. CWL uses a ^ prefix convention (each ^ removes one extension from the primary file name), but the code also supports STORE_SECONDARY_FILES_WITH_BASENAME mode.
The Representation Layer (Legacy Hack)
This section documents what the migration aims to eliminate. The legacy path converts Galaxy param_dict back to CWL inputs.
to_cwl_job() (representation.py:386-488)
Called by CwlTool.param_dict_to_cwl_inputs(). Walks tool.inputs (Galaxy's parsed parameter tree):
Handles secondary files by symlinking into an _inputs directory.
Why This Is a Problem
The round-trip (CWL schema → Galaxy widgets → user input → Galaxy param_dict → CWL job JSON) requires:
Every CWL type mapped to a Galaxy widget type (TYPE_REPRESENTATIONS)
Union types become Galaxy conditionals with _cwl__type_/_cwl__value_ keys
FieldTypeToolParameter in basic.py as a catch-all for CWL's flexible typing
DatasetWrappers must be reverse-engineered back to CWL File objects
All of this touches basic.py, which is core Galaxy infrastructure
With validated tool state, the input JSON goes directly to exec_before_job without this reverse-engineering.
Comparison: CWL vs YAML Tool Runtime
Aspect
CWL Tool (current)
YAML Tool (runtimeify)
Evaluator class
ToolEvaluator
UserToolEvaluator
param_dict_style
"regular"
"json"
Input state source
validated_tool_state.input_state (new) or param_dict_to_cwl_inputs() (legacy)
runtimeify(validated_tool_state)
Dataset → File conversion
Done in exec_before_job (input_json already has references) OR dataset_wrapper_to_file_json() (legacy)
Done by setup_for_runtimeify() adapters
Command building
Pre-computed by cwltool via JobProxy, stored in __cwl_command
Built at eval time via do_eval() with CWL expressions
Where command lives
param_dict["__cwl_command"]
Returned from _build_command_line()
Output collection
Post-execution relocate_dynamic_outputs.py script via cwltool
Standard Galaxy metadata
Job proxy needed
Yes — wraps cwltool.job.Job
No
Container handling
Galaxy wraps (cwltool use_container=False)
Galaxy wraps
File staging
cwltool PathMapper (symlinks)
Galaxy's standard input staging
Config files
None
YamlTemplateConfigFile
Environment vars
From cwltool (EnvVarRequirement)
From tool definition
Key Structural Differences
Command pre-computation: CWL delegates to cwltool at exec_before_job time. YAML tools evaluate at _build_command_line time. This is unavoidable — cwltool is the authoritative CWL command builder.
Two-phase output: CWL uses a post-execution script to collect outputs because cwltool needs to run its own output glob evaluation. YAML tools use Galaxy's standard metadata.
File staging: CWL uses cwltool's PathMapper. YAML tools use Galaxy's standard input path rewriting via compute_environment.input_path_rewrite().
No runtimeify() equivalent: CWL currently gets validated_tool_state.input_state directly. It does NOT go through runtimeify() to convert dataset references to File objects with paths. The input_state either already has the right format (new path) or gets reverse-engineered from param_dict (legacy path).
What a CWL Runtimeify Would Look Like
The goal: make CWL execution use typed state transitions similar to YAML tools.
Current New Path (partially done)
validated_tool_state.input_state (has dataset refs as {src: "hda", id: N})
↓
exec_before_job() filters + passes directly to JobProxy
↓
JobProxy._normalize_job() fills CWL defaults
↓
cwltool processes and generates command
What's Missing for Full Runtimeify
Dataset reference → File object conversion: Currently exec_before_job receives input_state with raw references. Someone needs to convert {src: "hda", id: N} to {"class": "File", "location": "/path/to/file", ...}. In the YAML path, runtimeify() + setup_for_runtimeify() does this. For CWL, this conversion could happen:
Option A: Inside exec_before_job (current approach — it has access to inp_data)
Option B: Via a CWL-specific runtimeify() before exec_before_job
Option C: Use UserToolEvaluator for CWL tools too (would need base_command or shell_command set)
Secondary files: YAML's runtimeify() doesn't handle secondary files. CWL needs them. dataset_wrapper_to_file_json() currently handles this in the legacy path.
Directory inputs: CWL directories are tar archives in Galaxy. Need extraction logic that the YAML path doesn't have.
Collection mapping: CWL arrays/records map to Galaxy collections. The YAML runtimeify() has adapt_collection but raises NotImplementedError for some cases.
The Input State Question
In the current new path, what does validated_tool_state.input_state look like for CWL tools? It appears to be the raw API input — dataset references but not yet File objects with paths. The conversion to CWL File objects (with location, size, checksum, secondary files) would need to happen somewhere before JobProxy gets the input dict.
The YAML tool path does this in setup_for_runtimeify() → adapt_dataset() which creates DataInternalJson objects (CWL File-like). A CWL equivalent would need to be richer — adding secondary files, CWL format URIs, checksums, etc.
Unresolved Questions
Where should dataset→File conversion happen for CWL? In exec_before_job (has inp_data dict), in a CWL-specific runtimeify, or somewhere else? The current code in exec_before_job just uses validated_tool_state.input_state directly — does this already contain resolved paths or just references?
Can CWL tools use UserToolEvaluator? They'd need base_command or shell_command. Could we set a synthetic shell_command that's the cwltool-generated command? Probably not — the command isn't known until JobProxy runs.
How close can file staging get to Galaxy's standard path? CWL uses cwltool's PathMapper for symlinks. YAML uses compute_environment.input_path_rewrite(). Could we skip PathMapper and use Galaxy's rewriting? Probably not for InitialWorkDirRequirement files.
Can output collection move inside Galaxy? Currently it's a post-execution script. Could collect_outputs() run inside Galaxy's job finishing instead of as a script appended to the job? This would avoid needing to serialize the full tool representation to .cwl_job.json.
ExpressionTool execution: These run JS, not shell commands. The current path returns ["true"] as the command and runs the expression during collect_outputs. How does this interact with the tool request API? Is there a simpler path?
What validated_tool_state.input_state actually contains for CWL right now? Need to trace a concrete test case to see the actual JSON structure at each phase. The filtering in exec_before_job (removing location == "None" files and empty strings) suggests the state may not be fully clean yet.
Secondary files in the new path: The legacy dataset_wrapper_to_file_json() reconstructs secondary files from __secondary_files__ directories. In the new path using validated_tool_state.input_state, who provides secondary file information?
CWL Tool Loading and Reference Test Infrastructure
Research document covering how Galaxy loads CWL tools from .cwl files into executable Tool objects, how the CWL reference/conformance test infrastructure works, and how loaded CWL tools interact with the tool request API.
ExpressionToolProxy (line 325) - subclass of CommandLineToolProxy, only changes _class = "ExpressionTool".
Step 5: Input Parameters - From CWL Schema to Galaxy Parameter Models
CWL inputs flow through two parallel systems:
A. Galaxy Legacy Parameters (parse_inputs in __init__.py)
Tool.parse_inputs() at lib/galaxy/tools/__init__.py:1718-1757:
defparse_inputs(self, tool_source):
self.has_galaxy_inputs=Falsepages=tool_source.parse_input_pages()
# CwlToolSource returns PagesSource with inputs_style="cwl"# PagesSource.inputs_defined returns True (style != "none")try:
parameters=input_models_for_pages(pages, self.profile)
self.parameters=parametersexceptException:
passifpages.inputs_defined:
self.has_galaxy_inputs=True# <-- WAS True for CWL# BUT the new branch bypasses this for CWL
Key change on this branch: has_galaxy_inputs is set True because inputs_style="cwl" is not "none". However, the expand_incoming_async() method at line 2183-2191 checks self.has_galaxy_inputs to decide whether to run Galaxy's parameter expansion machinery. When has_galaxy_inputs=False (forced for CWL in the new path), the raw state passes through.
B. CWL Parameter Models (New typed system)
CwlPageSource (parser/cwl.py:366) creates CwlInputSource objects from the tool proxy's input_instances().
These flow into input_models_for_pages() at lib/galaxy/tool_util/parameters/factory.py:453:
_from_input_source_cwl() (factory.py:421-436) maps CWL schema-salad types to parameter models:
CWL Type
Galaxy Parameter Model
parameter_type
int
CwlIntegerParameterModel
"cwl_integer"
float
CwlFloatParameterModel
"cwl_float"
string
CwlStringParameterModel
"cwl_string"
boolean
CwlBooleanParameterModel
"cwl_boolean"
null
CwlNullParameterModel
"cwl_null"
org.w3id.cwl.cwl.File
CwlFileParameterModel
"cwl_file"
org.w3id.cwl.cwl.Directory
CwlDirectoryParameterModel
"cwl_directory"
[type1, type2, ...] (union)
CwlUnionParameterModel
"cwl_union"
These models live in lib/galaxy/tool_util_models/parameters.py:1943-2100.
CwlFileParameterModel and CwlDirectoryParameterModel (lines 2061-2088) both use DataRequest as their py_type, meaning the API expects {src: "hda", id: <encoded_id>} for dataset inputs.
# Line 460-466eliftool_type:=tool_source.parse_tool_type():
ToolClass=tool_types.get(tool_type)
ifToolClassisNone:
iftool_type=="cwl":
raiseToolLoadError("Runtime support for CWL tools is not implemented currently")
# Line 5085-5103 - TOOL_CLASSES list includes:# CwlTool, # tool_type = "cwl"# GalacticCwlTool, # tool_type = "galactic_cwl"tool_types= {tool_class.tool_type: tool_classfortool_classinTOOL_CLASSES}
Note: The error at line 463-464 fires only if CwlTool is not in TOOL_CLASSES (i.e., on mainline Galaxy without CWL support). On this CWL branch, CwlTool IS in the list.
Conformance Test Provisioning: update_cwl_conformance_tests.sh
The CWL conformance test tools are not vendored or submoduled. They are downloaded on-demand by scripts/update_cwl_conformance_tests.sh and not committed to git. This is a two-stage process:
Stage 1: Shell Script Downloads Tools
File: scripts/update_cwl_conformance_tests.sh
For each CWL version (1.0, 1.1, 1.2):
Downloads the official CWL spec repo as a zip from GitHub:
Extracts into test/functional/tools/cwl_tools/v{version}/:
conformance_tests.yaml — the test manifest (different source paths per version: v1.0 uses v1.0/conformance_test_v1.0.yaml, others use root conformance_tests.yaml)
The test tools directory — v1.0 copies v1.0/v1.0/ (creating the cwl_tools/v1.0/v1.0/ path that sample_tool_conf.xml references), others copy tests/
Runsscripts/cwl_conformance_to_test_cases.py to generate Python test files
Result directory structure after running:
test/functional/tools/cwl_tools/
├── v1.0/
│ ├── conformance_tests.yaml
│ └── v1.0/ # actual test tools (cat1-testcli.cwl, bwa-mem-tool.cwl, etc.)
├── v1.0_custom/ # committed Galaxy-specific CWL test tools
├── v1.1/
│ ├── conformance_tests.yaml
│ └── tests/ # CWL v1.1 test tools
└── v1.2/
├── conformance_tests.yaml
└── tests/ # CWL v1.2 test tools
Stage 2: Python Script Generates Test Cases
File: scripts/cwl_conformance_to_test_cases.py
Readsconformance_tests.yaml recursively (following $import references via its own conformance_tests_gen())
For each conformance test entry, generates a pytest method in a TestCwlConformance class:
@pytest.mark.cwl_conformance@pytest.mark.cwl_conformance_v1_0@pytest.mark.command_line_tool# from CWL test tags@pytest.mark.green# or @pytest.mark.reddeftest_conformance_v1_0_cat1(self):
"""Test doc string..."""self.cwl_populator.run_conformance_test("v1.0", "Test doc string...")
Tests are marked red (known-failing in Galaxy) or green based on a hardcoded RED_TESTS dict:
v1.0: ~30 red tests (mostly scatter/valuefrom/subworkflow/secondary files)
v1.1: ~50 red tests (adds timelimit, networkaccess, inplace_update, etc.)
v1.2: ~100+ red tests (adds conditionals, v1.2-specific features)
Writes generated test file to lib/galaxy_test/api/cwl/test_cwl_conformance_v{version_simple}.py
The generated test class extends BaseCwlWorkflowsApiTestCase and each method calls self.cwl_populator.run_conformance_test(version, doc) — which looks up the test by doc string in conformance_tests.yaml, stages inputs, runs the tool/workflow, and compares outputs
The generated test files ARE committed; the downloaded tool files are NOT.
Unit tests in test/unit/tool_util/test_cwl.py reference paths like v1.0/v1.0/cat1-testcli.cwl — these require update_cwl_conformance_tests.sh to have been run first.
Tool Configuration for Tests
File: test/functional/tools/sample_tool_conf.xml
All test tools are registered in this file. The CWL section (lines 268-287):
Note: Several entries reference cwl_tools/v1.0/v1.0/*.cwl which do not exist on this branch. These tools would fail to load. Only parameters/cwl_int.cwl, the v1.0_custom/ tools, and the root-level galactic tools exist.
run_cwl_job(artifact, job_path, ...) (line 3084): Main entry point. Determines if artifact is tool or workflow, stages inputs via stage_inputs(), then dispatches to _run_cwl_tool_job() or _run_cwl_workflow_job().
_run_cwl_tool_job(tool_id, job, history_id) (line 3030): Posts to tool request API via tool_request_raw(). If tool doesn't exist in Galaxy, creates it as a dynamic tool via create_tool_from_path().
run_conformance_test(version, doc) (line 3150): Loads conformance test spec, runs the CWL job, and compares outputs using cwltest.compare.compare().
get_conformance_test(version, doc) (line 3024): Looks up a test by its doc field from conformance_tests.yaml in the test directory.
This expects conformance_tests.yaml in each CWL version directory (e.g., test/functional/tools/cwl_tools/v1.0/conformance_tests.yaml). Each test entry has tool, job, output, and doc fields.
Inside handle_input_async, expand_incoming_async() is called:
# __init__.py:2183-2191ifself.has_galaxy_inputs:
expanded_incomings, job_tool_states, collection_info=expand_meta_parameters_async(...)
else:
# CWL tools: pass state through as-isexpanded_incomings= [deepcopy(tool_request_internal_state.input_state)]
job_tool_states= [deepcopy(tool_request_internal_state.input_state)]
collection_info=None
Since CWL tools bypass Galaxy's parameter expansion, the input state passes through unchanged. A JobInternalToolState is created and validated against the tool's CWL parameter models:
The JobInternalToolState.input_state dict is persisted as JSON on the Job model. For CWL tools, this contains the raw CWL-compatible inputs with dataset references as {src: "hda", id: <int>}.
Celery Serialization of CWL Tools
The tool request API dispatches jobs via Celery. The tool itself must be serializable:
File: lib/galaxy/tools/execute.py:326-345
raw_tool_source, tool_source_class=tool.to_raw_tool_source()
# For CWL: tool_source_class = "CwlToolSource"# raw_tool_source = JSON string of ToolProxy.to_persistent_representation()
exec_before_job (__init__.py:3757-3829): Takes validated_tool_state.input_state, creates JobProxy, pre-computes command via cwltool, stores in param_dict["__cwl_command"].
The Input State Gap
Currently there is a structural gap in the new path: validated_tool_state.input_state at exec_before_job time still contains dataset references ({src: "hda", id: N}) rather than CWL File objects with paths. The JobProxy._normalize_job() expects File objects with path or location keys.
This conversion (dataset reference -> CWL File object with filesystem path) is the key missing piece. In the YAML tool path, runtimeify() + setup_for_runtimeify() handles this. For CWL, it needs to happen somewhere between state reconstruction and JobProxy creation, enriched with CWL-specific data (secondaryFiles, format URIs, etc.).
Dynamic Tool Loading (Test Infrastructure)
When a CWL tool is not pre-loaded in the toolbox, tests create it dynamically:
create_tool_from_path() (line 1057) posts to Galaxy's dynamic tool creation API with src="from_path". This uses lib/galaxy/managers/tools.py which requires enable_beta_tool_formats config.
Only parameters/cwl_int.cwl is in sample_tool_conf.xml from the parameters directory — should other CWL parameter tools (cwl_float.cwl, cwl_string.cwl, cwl_file.cwl, etc.) be added?
The has_galaxy_inputs flag for CWL is True because inputs_style="cwl" satisfies inputs_defined. How is this being overridden to False in the new path? Is there a separate mechanism?
How are CWL array and record input types handled by the new parameter model system? _from_input_source_cwl() only handles simple types and unions — no array/record support yet.
CwlUnionParameterModel has request_requires_value = False (with TODO comment) — is this correct for all unions, or only unions containing null?
How to plumb CWL tool execution from persisted JobInternalToolState through to cwltool command extraction, using the YAML tool runtimeify infrastructure.
Branch: cwl_on_tool_request_api_2
Entry Assumptions
Request entered via Tool Request API (POST /api/jobs)
Job object has job.tool_state containing persisted JobInternalToolState (dataset refs as {src: "hda", id: N})
CWL ToolParameter objects available matching the state schema
job.tool_state is the JSON dict persisted at job creation (execute.py:254-256). JobInternalToolState wraps it and validates against the tool's CWL parameter model. The validated state flows into execute_tool_hooks() at line 222.
This is the core new work. Currently exec_before_job receives validated_tool_state.input_state with raw {src: "hda", id: N} references and passes them directly to JobProxy. But JobProxy._normalize_job() (parser.py:376-391) expects File objects with path/location keys. That's a structural mismatch.
2a: Call runtimeify before exec_before_job
Where: ToolEvaluator.set_compute_environment(), in the param_dict_style == "regular" branch at evaluation.py:200-222.
Change: After reconstructing internal_tool_state and before calling execute_tool_hooks, insert a runtimeify call:
internal_tool_state=Noneifjob.tool_state:
internal_tool_state=JobInternalToolState(job.tool_state)
internal_tool_state.validate(self.tool, f"{self.tool.id} (job internal model)")
# NEW: runtimeify for CWL toolsifinternal_tool_stateisnotNoneandself.tool.tool_typeinCWL_TOOL_TYPES:
fromgalaxy.tool_util.parameters.convertimportruntimeifyfromgalaxy.tools.runtimeimportsetup_for_runtimeifyhda_references, adapt_dataset, adapt_collection=setup_for_runtimeify(
self.app, compute_environment, inp_data, input_dataset_collections
)
job_runtime_state=runtimeify(
internal_tool_state, self.tool, adapt_dataset, adapt_collection
)
# Replace internal_tool_state with runtime state for exec_before_jobinternal_tool_state=job_runtime_state
Why here and not inside exec_before_job: setup_for_runtimeify needs inp_data (the {name: HDA} dict) and compute_environment, both available at this scope. This mirrors where UserToolEvaluator calls runtimeify in its build_param_dict() (evaluation.py:1130-1170). Keeping runtimeify in the evaluator means exec_before_job receives already-resolved File objects -- cleaner separation.
What runtimeify returns: A JobRuntimeToolState whose input_state dict has DataInternalJson File objects instead of {src: "hda", id: N} references. Each File object contains class, path, basename, nameroot, nameext, format, size, location, listing.
The state type passed to exec_before_job changes: from JobInternalToolState to JobRuntimeToolState. The execute_tool_hooks signature accepts Optional[JobInternalToolState] -- we either widen it to accept both types, or pass the runtime state's input_state dict through a wrapper. The simplest approach: change the parameter type to Optional[Union[JobInternalToolState, JobRuntimeToolState]] and update exec_before_job to accept either. Both expose .input_state with the same interface.
2b: CWL-specific adapt_dataset -- secondary files, format URIs
The existing adapt_dataset in runtime.py:77-94 produces a basic DataInternalJson with class, path, basename, format (Galaxy extension), size, location, listing. This is sufficient for YAML tools. CWL needs more:
Secondary files -- stored at {hda.extra_files_path}/__secondary_files__/
CWL format URIs -- http://edamontology.org/{edam_format} (available via hda.cwl_formats, model __init__.py:5212-5213)
Checksum -- optional, cwltool doesn't require it for command generation
Option A (preferred): CWL-specific adapt_dataset callback.
Create a new function in runtime.py (or a new cwl_runtime.py module) that wraps the base adapt_dataset:
# lib/galaxy/tools/cwl_runtime.py (new file)defsetup_for_cwl_runtimeify(app, compute_environment, input_datasets, input_dataset_collections=None):
"""CWL-enriched version of setup_for_runtimeify. Returns (hda_references, adapt_dataset, adapt_collection) where adapt_dataset produces CwlDataInternalJson with secondaryFiles. """hda_references, base_adapt_dataset, adapt_collection=setup_for_runtimeify(
app, compute_environment, input_datasets, input_dataset_collections
)
hdas_by_id= {d.id: dfordininput_datasets.values() ifdisnotNone}
defadapt_dataset(value):
base_result=base_adapt_dataset(value)
hda=hdas_by_id.get(value.id)
ifhdaisNone:
returnbase_resultresult_dict=base_result.model_dump(by_alias=True)
# Enrich with secondary filessecondary_files=discover_secondary_files(hda, compute_environment)
ifsecondary_files:
result_dict["secondaryFiles"] =secondary_files# Enrich with CWL format URI (replace Galaxy extension with EDAM URI)ifhasattr(hda, 'cwl_formats') andhda.cwl_formats:
result_dict["format"] =str(hda.cwl_formats[0])
returnCwlDataInternalJson(**result_dict)
returnhda_references, adapt_dataset, adapt_collection
Option B (alternative): Enrich after runtimeify.
Call base runtimeify(), then walk the result and enrich File objects with secondary files. Downside: requires a second traversal. Option A is cleaner.
2c: discover_secondary_files function
New function in cwl_runtime.py:
defdiscover_secondary_files(hda, compute_environment=None):
"""Discover secondary files for an HDA from its extra_files_path. Secondary files are stored at {extra_files_path}/__secondary_files__/ with an ordering index at __secondary_files_index.json. Returns list of dicts: [{"class": "File"|"Directory", "path": "...", "basename": "..."}] """extra_files_path=hda.extra_files_pathsecondary_files_dir=os.path.join(extra_files_path, SECONDARY_FILES_EXTRA_PREFIX)
ifnotos.path.exists(secondary_files_dir):
return []
secondary_files= []
fornameinos.listdir(secondary_files_dir):
sf_path=os.path.join(secondary_files_dir, name)
real_path=os.path.realpath(sf_path)
is_dir=os.path.isdir(real_path)
entry= {
"class": "Directory"ifis_direlse"File",
"path": compute_environment.input_path_rewrite(sf_path) ifcompute_environmentelsesf_path,
"basename": name,
}
secondary_files.append(entry)
returnsecondary_files
This mirrors the logic in representation.py:163-183 (dataset_wrapper_to_file_json) but works from HDA objects rather than DatasetWrappers.
Important: The legacy code in representation.py:168-183 symlinks the primary file and secondary files into a shared _inputs directory so basename-based references work. We need the same symlinking -- but stage_files() in JobProxy handles this via cwltool's PathMapper. The paths in secondaryFiles should be the real paths; cwltool will stage them.
2d: CwlDataInternalJson model
Where: lib/galaxy/tool_util_models/parameters.py, extend from DataInternalJson
Alternative: Add secondaryFiles directly to DataInternalJson as an optional field (it's already commented out at line 610). This avoids a subclass but pollutes the base model with CWL concerns. Prefer the subclass.
Validation concern: runtimeify() validates the output as JobRuntimeToolState (convert.py:576). The validation calls validate(input_models) which checks against parameter model schemas. CwlDataInternalJson must be accepted wherever DataInternalJson is. Since CwlDataInternalJson extends DataInternalJson, and secondaryFiles is optional, existing validators should accept it. The model_dump(by_alias=True) output will include secondaryFiles only when present.
2e: How runtimeify walks CWL parameters
runtimeify() (convert.py:539-577) uses visit_input_values() to walk the tool's parameter model. For CWL tools, the parameter model consists of CwlInputParameter objects (or similar). The visitor identifies DataParameterModel instances and calls adapt_dict() on their values.
Critical question: Do CWL tool parameters implement DataParameterModel? The CWL parameter model (CwlInputParameter) must either be or extend DataParameterModel for file-type inputs, or runtimeify won't recognize them. If CWL parameters use a different type hierarchy, we need either:
Extend the to_runtime_callback to recognize CWL file parameters, or
Ensure CWL file parameters are modeled as DataParameterModel
This needs verification by tracing the actual parameter model instances for a CWL tool.
Step 3: How Runtimeified State Flows into exec_before_job and JobProxy
3a: exec_before_job receives runtimeified state
Where: CwlCommandBindingTool.exec_before_job() at tools/__init__.py:3757-3829
Current code at line 3765:
input_json=validated_tool_state.input_state
After runtimeify, validated_tool_state is a JobRuntimeToolState. Its .input_state now contains CWL File objects instead of {src: "hda", id: N} refs. Example:
After runtimeify: The location == "None" check likely won't trigger because runtimeify only produces File objects for HDAs that actually exist (they come from inp_data which only has real datasets). Optional parameters with no value should appear as None in the state, not as File objects with location == "None".
The empty string filter handles optional string params with no value. This should still work -- runtimeify passes non-data parameters through unchanged (VISITOR_NO_REPLACEMENT).
Recommendation: Keep both filters for safety during transition, but add a TODO to remove once the old path is dead.
Resolve dataset paths via compute_environment.input_path_rewrite()
Populate basename, nameroot, nameext, format, size
Discover and attach secondary files
Resolve CWL format URIs
What _normalize_job() still does (parser.py:376-391):
process.fill_in_defaults() -- fills CWL default values for parameters not provided in the input dict. Runtimeify doesn't know about CWL defaults (they're in the cwltool Process object, not the Galaxy parameter model). Still needed.
visit_class(input_dict, ("File", "Directory"), pathToLoc) -- converts "path" keys to "location" keys. Runtimeify produces File objects with both path and location (the DataInternalJson model has both fields). cwltool expects location for its internal processing. Still needed, but becomes a no-op if location is already set. The pathToLoc callback only acts when location is absent:
Since runtimeify sets both path and location, this callback won't fire for runtimeified inputs. It's still needed for files injected by fill_in_defaults (CWL defaults might use path only).
No structural transformation -- _normalize_job doesn't restructure records, arrays, or unions. It assumes the input dict already matches the CWL schema structure. Runtimeify preserves the original structure (it only replaces leaf values).
What _normalize_job does NOT need to change:
The function is already correct for the new path. It receives File objects (from runtimeify) where it previously received File objects (from to_cwl_job/galactic_flavored_to_cwl_job). The only difference is the source of those File objects.
Secondary files in _normalize_job:
visit_class also visits secondary files recursively (they're nested dicts with class: "File"). The pathToLoc callback will convert their path to location if needed. This works correctly with the secondary files we attach in discover_secondary_files.
Step 5: Comparison with YAML Tool Runtimeify Infrastructure
What we reuse from YAML tools:
Component
Location
Reuse
runtimeify()
convert.py:539-577
Direct reuse -- same function, same visitor
visit_input_values()
convert.py (via parameter visitor)
Direct reuse
setup_for_runtimeify()
runtime.py:50-123
Base infrastructure reused; CWL wraps it
DataInternalJson
tool_util_models/parameters.py:594-612
Base model; CWL extends it
JobRuntimeToolState
tool_util_models/parameters.py (state.py)
Direct reuse
set_basename_and_derived_properties()
cwl/util.py:44-47
Already shared
What CWL adds on top:
Component
Location
Purpose
CwlDataInternalJson
tool_util_models/parameters.py (new)
Extends DataInternalJson with secondaryFiles
CwlSecondaryFileJson
tool_util_models/parameters.py (new)
Model for secondary file entries
setup_for_cwl_runtimeify()
tools/cwl_runtime.py (new)
Wraps setup_for_runtimeify with CWL enrichment
discover_secondary_files()
tools/cwl_runtime.py (new)
Finds secondary files in HDA extra_files_path
Key structural differences from YAML path:
YAML: UserToolEvaluator.build_param_dict() calls runtimeify, then uses result for do_eval() expression evaluation.
CWL: ToolEvaluator.set_compute_environment() calls runtimeify, passes result to exec_before_job(), which feeds it to JobProxy/cwltool.
YAML: Command built at _build_command_line() time via JavaScript expressions against runtimeified state.
CWL: Command pre-computed by cwltool at exec_before_job() time, stashed in param_dict["__cwl_command"].
YAML: adapt_dataset returns DataInternalJson (no secondary files).
CWL: adapt_dataset returns CwlDataInternalJson (with secondary files, CWL format URIs).
YAML: No post-runtimeify normalization needed.
CWL: _normalize_job() still runs fill_in_defaults and pathToLoc after runtimeify.
Lines 200-222: Insert runtimeify call in the param_dict_style == "regular" branch, between state reconstruction and execute_tool_hooks:
# After line 220 (validate internal_tool_state):ifinternal_tool_stateisnotNoneandself.tool.tool_typeinCWL_TOOL_TYPES:
fromgalaxy.tool_util.parameters.convertimportruntimeifyfromgalaxy.tools.cwl_runtimeimportsetup_for_cwl_runtimeifyinput_dataset_collections=None# TODO: wire up from job.io_dicts()hda_references, adapt_dataset, adapt_collection=setup_for_cwl_runtimeify(
self.app, compute_environment, inp_data, input_dataset_collections
)
internal_tool_state=runtimeify(
internal_tool_state, self.tool, adapt_dataset, adapt_collection
)
Type signature change: execute_tool_hooks() at line 236 accepts Optional[JobInternalToolState]. Widen to Optional[Union[JobInternalToolState, JobRuntimeToolState]]. Same for exec_before_job in tools/__init__.py.
Line 3757: Update type hint from Optional[JobInternalToolState] to Optional[Union[JobInternalToolState, JobRuntimeToolState]].
Lines 3775-3783: Keep the filtering but note it should be mostly unnecessary after runtimeify. Optional CWL inputs with no dataset should appear as None values, not as fake File objects.
No other changes -- the rest of exec_before_job (output_dict construction, JobProxy creation, command extraction, staging, saving) works the same regardless of whether input_json came from runtimeify or from the old path.
tool_util_models/parameters.py
After line 612: Add CwlSecondaryFileJson and CwlDataInternalJson models (see Step 2d).
convert.py -- runtimeify()
No changes to the function itself. The visitor pattern already handles any DataParameterModel leaf. The CWL-specific enrichment is handled by the callback (adapt_dataset), not by runtimeify's logic.
One potential issue: If CWL tool parameters aren't modeled as DataParameterModel, the visitor won't recognize them. See Unresolved Questions.
parser.py -- JobProxy._normalize_job()
No changes needed.fill_in_defaults and pathToLoc work correctly on pre-runtimeified input dicts. The pathToLoc callback is a no-op for entries that already have location set.
parser.py -- ToolProxy.init() format stripping
Lines 143-148 strip format from inputs_record_schema fields to prevent cwltool from complaining about missing format in input data. With the new path, we're providing CWL format URIs in the File objects (via hda.cwl_formats). This format stripping may become unnecessary -- but keep it for now since not all inputs may have format URIs, and it's a safe no-op if format URIs are present.
CwlSecondaryFileJson -- model for secondary file entries
CwlDataInternalJson(DataInternalJson) -- extends with secondaryFiles: Optional[List[CwlSecondaryFileJson]]
Summary: Data Flow
job.tool_state (JSON dict with {src: "hda", id: N} refs)
|
| JobInternalToolState(job.tool_state)
| .validate(tool, ...)
v
JobInternalToolState
|
| runtimeify(state, tool, cwl_adapt_dataset, adapt_collection)
| cwl_adapt_dataset:
| - resolves HDA path via compute_environment.input_path_rewrite()
| - sets basename, nameroot, nameext, format (EDAM URI), size
| - calls discover_secondary_files() for __secondary_files__
| - returns CwlDataInternalJson
v
JobRuntimeToolState
|
| exec_before_job(validated_tool_state=runtime_state)
| input_json = runtime_state.input_state
v
input_json (dict with CWL File objects, secondary files attached)
|
| Filter: remove location=="None", empty strings
|
| JobProxy(tool_proxy, input_json, output_dict, job_dir)
v
JobProxy._normalize_job()
| fill_in_defaults() -- inject CWL defaults for missing params
| visit_class(pathToLoc) -- convert path->location (mostly no-op)
v
JobProxy._ensure_cwl_job_initialized()
| Process.job(input_dict, callback, runtime_context) -> cwltool Job
v
cwltool Job
| .command_line -> list of args (with $GALAXY_SLOTS unsentineling)
| .stdin, .stdout, .stderr
| .environment
| .pathmapper -> for stage_files()
v
param_dict["__cwl_command"] = assembled command string
param_dict["__cwl_command_state"] = {args, stdin, stdout, stderr, env}
|
| ToolEvaluator.build()
| _build_command_line() -> uses __cwl_command verbatim
| _build_environment_variables() -> reads from __cwl_command_state.env
v
Command string + environment -> job script execution
Testing Strategy
Unit test discover_secondary_files(): Create mock HDA with extra_files_path containing __secondary_files__/ directory. Verify correct File/Directory classification and path resolution.
Unit test CWL adapt_dataset: Given an HDA with secondary files and CWL format, verify CwlDataInternalJson has correct secondaryFiles list and EDAM format URI.
Unit test runtimeify with CWL parameters: Create a CWL tool parameter model, a JobInternalToolState with HDA refs, run runtimeify, verify output has File objects with correct structure.
Integration test: Run a CWL tool with secondary files (e.g., BAM + BAI) through the tool request API. Verify:
job.tool_state persisted correctly
runtimeify produces File objects with secondaryFiles
cwltool receives correct input dict
Command line generated correctly
Output collection works
Regression test: Existing CWL conformance tests should pass with the new path.
Unresolved Questions
CWL parameters as DataParameterModel? Does runtimeify's visit_input_values recognize CWL file parameters? If CWL uses CwlInputParameter instead of DataParameterModel, the visitor won't call adapt_dataset. Need to trace actual parameter model types for a CWL tool.
Directory inputs? CWL directories are stored as tar archives in Galaxy (ext directory). Legacy code in representation.py:198+ untars into _inputs dir. Where does this happen in the new path? adapt_dataset produces class: "File" but directories need class: "Directory" with listing. Possibly needs a separate adapt_directory or a type check in adapt_dataset.
Collection inputs? CWL arrays/records map to Galaxy collections. runtimeify has adapt_collection support but raises NotImplementedError for some cases in the base path. Do CWL tools ever receive collection inputs through the tool request API? If so, how are they modeled?
Format stripping still needed?ToolProxy.__init__ strips format from inputs_record_schema (parser.py:143-148) so cwltool won't reject inputs missing format. If we now provide format URIs in File objects, does cwltool validate them against the schema format? If so, the stripping may cause a mismatch (schema says no format, input provides one). Needs testing.
compute_environment for secondary file paths?discover_secondary_files uses compute_environment.input_path_rewrite() for path resolution. Does input_path_rewrite work on arbitrary paths (like extra_files_path subdirectories), or only on HDA file names? If the latter, secondary file paths may need a different rewriting strategy.
Expression tools? ExpressionTools have no command line -- they run JS during output collection. The current path returns ["true"] as command. Does runtimeify affect ExpressionTool execution at all? The input_json still needs to flow to JobProxy.save_job() for output collection to work.
Symlink staging for secondary files? Legacy dataset_wrapper_to_file_json symlinks primary + secondary files into a shared _inputs dir so basename-relative references work. With runtimeify, secondary file paths point to extra_files_path/__secondary_files__/. Will cwltool's PathMapper.stage_files() handle staging them correctly, or do we need explicit symlinking?
input_dataset_collections wiring? The runtimeify call in set_compute_environment needs input_dataset_collections. job.io_dicts() returns (inp_data, out_data, out_collections) at line 185 but not input collections directly. Need to verify how job.input_dataset_collections maps to the dict format setup_for_runtimeify expects.
How CWL workflow execution interacts with the tool runtime code, what legacy code the workflow runner depends on, and what changes are needed to make the validated runtime plan work for workflows.
evaluate_value_from_expressions() (line 2426) converts all step state to CWL format via to_cwl() (modules.py:152-226)
Evaluates CWL JavaScript expressions via do_eval() against the CWL-format state
from_cwl() (modules.py:229-242) converts results back to Galaxy objects
Results applied via expression_callback which handles FieldTypeToolParameter specially (line 2647)
Phase 4: Job submission (lines 2673-2699)
Creates MappingParameters(tool_state.inputs, param_combinations) -- only two args, no validated_param_combinations
Calls execute() from tools/execute.py
This goes through handle_single_execution() -> _execute() which creates the Job
Step outputs (HDAs/HDCAs) flow back to progress.set_step_outputs()
Data Flow Between Steps
Between workflow steps, data flows as Galaxy model objects (HDAs and HDCAs), NOT as CWL File objects:
Step A outputs: {output_name: HDA}
| stored in progress.outputs[step_id]
|
Step B execution:
| callback asks progress.replacement_for_input()
| which calls progress.replacement_for_connection() -- run.py:615
| which returns the HDA from progress.outputs[step_id]
|
| If FieldTypeToolParameter: wraps as {"src": "hda", "value": HDA}
| If DataToolParameter: passes HDA directly
v
execution_state.inputs = {input_name: HDA or wrapped_value}
The to_cwl() function (modules.py:152-226) converts Galaxy objects to CWL-format dicts only for valueFrom/when expression evaluation. It is NOT used for the main tool execution path. For expression evaluation:
from_cwl() (modules.py:236) -- for File results from valueFrom expressions
set_outputs_for_input() (run.py:762-764) -- for workflow input steps that receive CWL File dicts
replacement_for_input() (run.py:590,612) -- for step inputs with value_from defaults
compute_runtime_state() callback (modules.py:488) -- for step inputs with default values
expression_callback in ToolModule.execute() (modules.py:2654-2656) -- for CWL File defaults not handled elsewhere
FieldTypeToolParameter.from_json() (basic.py:2924-2925) -- when deserializing CWL File objects in parameter values
CWL Scatter
CWL scatter is mapped to Galaxy's implicit collection mapping at import time:
At import: InputProxy.to_dict() sets scatter_type on step input connections (parser.py:1007-1010). Scatter inputs get "dotproduct" (or the explicit scatterMethod). Non-scatter inputs get "disabled".
At runtime: _find_collections_to_match() (modules.py:591-640) reads step_input.scatter_type and builds CollectionsToMatch. When scatter_type == "disabled", the input is skipped for collection matching (line 602-606). Collection inputs with scatter_type == "dotproduct" are matched using Galaxy's standard collection mapping infrastructure.
The mapping: CWL scatter over an array input corresponds to Galaxy iterating over a list collection. The workflow runner creates implicit collections (mapped-over outputs) via execution_tracker.implicit_collections.
Limitation: Only dotproduct and disabled are asserted (line 602). flat_crossproduct is defined in the parser (parser.py:1008) but not handled by the workflow runner.
valueFrom Expressions
CWL valueFrom JavaScript expressions are evaluated at workflow scheduling time, NOT at tool execution time.
Where defined: InputProxy.to_dict() stores value_from on step input connections (parser.py:1011-1013). These are persisted on WorkflowStepInput.value_from in the database.
Where evaluated: ToolModule.evaluate_value_from_expressions() (modules.py:2426-2458) and the top-level evaluate_value_from_expressions() (modules.py:245-298) for when expressions.
How evaluated:
All step state (both execution_state inputs and extra_step_state) converted to CWL format via to_cwl() (lines 2443-2447)
Each value_from expression evaluated by do_eval(value_from, step_state, context=step_state[key]) (lines 2450-2453)
Results converted back via from_cwl() (line 2455)
Applied to execution_state.inputs via expression_callback (lines 2642-2662)
Dependency on to_cwl(): The expression evaluation depends on to_cwl() to convert Galaxy objects (HDAs, collections) to CWL-compatible dicts that JavaScript expressions can operate on. This is a modules.py-local function, NOT part of representation.py.
Dependency on from_cwl(): Results are converted back via from_cwl(), which depends on raw_to_galaxy() for File results. This creates deferred HDAs.
Workflow Dependencies on Tool Runtime Code
What the Workflow Runner USES
Component
Location
Used By Workflow Runner?
Purpose
to_cwl()
modules.py:152
YES
Convert Galaxy objects to CWL for valueFrom/when expressions
from_cwl()
modules.py:229
YES
Convert CWL expression results back to Galaxy objects
raw_to_galaxy()
basic.py:2818
YES
Create deferred HDAs from CWL File dicts
FieldTypeToolParameter
basic.py:2907
YES
CWL catch-all parameter type, used for wrapping values in visit_input_values callback
do_eval()
expressions/evaluation.py:21
YES
Evaluate CWL JavaScript expressions
set_basename_and_derived_properties()
cwl/util.py:44
YES (via to_cwl)
Set basename/nameroot/nameext on File objects
to_galaxy_parameters()
representation.py:491
INDIRECT
Called by CwlTool.inputs_from_dict() for inputs_representation="cwl" API submissions
FieldTypeToolParameter.from_json()
basic.py:2915
YES
When workflow populates step state from API
What the Workflow Runner DOES NOT USE
Component
Location
Why Not Used
to_cwl_job()
representation.py:386
Only called from CwlTool.param_dict_to_cwl_inputs(), NOT from workflow code
galactic_flavored_to_cwl_job()
representation.py:286
Only called from GalacticCwlTool.param_dict_to_cwl_inputs()
dataset_wrapper_to_file_json()
representation.py:155
Only called within to_cwl_job() and galactic_flavored_to_cwl_job()
galactic_job_json()
cwl/util.py:153
Client-side staging only (test framework, planemo)
output_to_cwl_json()
cwl/util.py:513
Test infrastructure only (populators.py)
JobProxy
parser.py:329
Used by exec_before_job, not by workflow runner
runtime_actions.py
cwl/runtime_actions.py
Post-execution output collection script, not workflow runner
The Key Distinction
The workflow runner operates at a different layer than tool execution:
Workflow runner converts between Galaxy model objects (HDAs, HDCAs, DatasetCollections) and CWL-format dicts for expression evaluation. It uses to_cwl() (modules.py) and from_cwl() (modules.py) -- these are workflow-specific functions, NOT representation.py functions.
Tool execution converts between Galaxy parameter dicts (DatasetWrappers, wrapped values) and CWL job JSON (File objects with paths). It uses to_cwl_job() / galactic_flavored_to_cwl_job() (representation.py) -- this is the legacy code the plan eliminates.
These are separate code paths that both happen to produce CWL-format dicts.
What Legacy Code the Workflow Runner Needs
Code That Blocks Cleanup
1. FieldTypeToolParameter (basic.py:2907-2970)
Used by workflow runner at:
modules.py:2567 -- isinstance(input, FieldTypeToolParameter) to classify inputs as "data"
modules.py:2591-2597 -- wrapping values for FieldTypeToolParameter with {"src": "hda/hdca/json", "value": ...}
modules.py:2647 -- expression_callback wraps expression results for FieldTypeToolParameter
Why it blocks cleanup: CWL tool inputs are parsed as FieldTypeToolParameter instances via the CWL tool parser. The workflow step execution callback in ToolModule.execute() uses isinstance(input, FieldTypeToolParameter) to determine how to pass values. If we remove FieldTypeToolParameter, the workflow runner needs a different way to identify CWL-typed inputs.
Assessment: FieldTypeToolParameter CAN be removed from the tool execution path (the validated runtime plan bypasses it), but the workflow runner's visit_input_values callbacks still need to handle CWL inputs specially. With has_galaxy_inputs=False, the tool will have no Galaxy-style inputs, so tool.inputs will be empty (or different), and visit_input_values won't find any parameters to visit. This changes how the workflow runner wires inputs -- see Impact section.
2. raw_to_galaxy() (basic.py:2818-2904)
Used by workflow runner at: 7 locations (see list above).
Why it blocks cleanup: Creates deferred HDAs from CWL File dict objects. This is needed when:
valueFrom expressions produce File objects that need to become HDAs
Step inputs have CWL File defaults
Workflow inputs are specified as CWL File dicts
Assessment: This function is NOT part of representation.py and is NOT part of the legacy parameter hack. It's a general utility for creating HDAs from CWL-format dicts. It should survive cleanup. However, it lives in basic.py alongside FieldTypeToolParameter -- if basic.py gets major CWL surgery, raw_to_galaxy() needs to be preserved (perhaps moved to a more appropriate module).
3. to_cwl() / from_cwl() (modules.py:152-242)
These are workflow-specific functions in modules.py, NOT in representation.py. They convert Galaxy model objects (HDAs, DatasetCollections) to/from CWL-format dicts for JavaScript expression evaluation.
Why they exist: CWL valueFrom and when expressions operate on CWL-format data. The workflow runner must convert Galaxy objects to CWL format before evaluating expressions, and convert results back.
Assessment: These are independent of the legacy parameter hack and will continue to be needed regardless of the validated runtime plan.
Called from: CwlTool.inputs_from_dict() (tools/init.py:3877) when inputs_representation == "cwl".
Used by workflow runner: Not directly. But inputs_from_dict() is called from ToolsService (webapps/galaxy/services/tools.py:327) when a tool is invoked via the old /api/tools endpoint with CWL-format inputs. This is not the workflow path.
Assessment: With the tool request API, to_galaxy_parameters() becomes dead code. The workflow runner never calls it.
5. Conditional / Repeat parameter types for CWL
The legacy representation layer creates Galaxy Conditional and Repeat inputs to model CWL union types and arrays. These are referenced in:
representation.py:to_cwl_job() -- walks conditionals and repeats to reverse-engineer CWL types
representation.py:to_galaxy_parameters() -- creates conditional/repeat request format
Assessment: These are used ONLY in the tool execution path (to_cwl_job) and the API input conversion path (to_galaxy_parameters). The workflow runner's visit_input_values callback encounters them in tool.inputs, but after the validated runtime plan, CWL tools won't have these Galaxy-style inputs at all.
Impact of the Validated Runtime Plan on Workflows
The Critical Bug: validated_tool_state Is None for Workflow Steps
This is the most important finding.
The current WIP code at tools/__init__.py:3765:
input_json=validated_tool_state.input_state
This will crash (AttributeError: 'NoneType') when a CWL tool runs within a workflow because:
ToolModule.execute() creates MappingParameters(tool_state.inputs, param_combinations) with only two args (modules.py:2674)
validated_param_combinations defaults to None
In execute_single_job(), execution_slice.validated_param_combination is None
In _execute(), job.tool_state is NOT set because the validated state is None (execute.py:254-256)
In ToolEvaluator.set_compute_environment(), job.tool_state is None, so internal_tool_state is None (evaluation.py:217-220)
None is passed to exec_before_job(validated_tool_state=None) (evaluation.py:239)
Line 3765 crashes: None.input_state
The legacy fallback that would have caught this was:
This has been removed in the WIP commits, replaced by the direct validated_tool_state.input_state access.
What Needs to Change for Workflows
There are two approaches:
Approach A: Make the Workflow Runner Set validated_param_combinations
The workflow runner would need to:
Build a JobInternalToolState for each CWL tool step (converting the Galaxy object-rich execution_state.inputs into a serializable dict with {src: "hda", id: N} references)
Pass these as validated_param_combinations in MappingParameters
This is harder than it sounds because execution_state.inputs for CWL tools contains wrapped values from FieldTypeToolParameter ({"src": "hda", "value": <HDA>}, {"src": "json", "value": 42}), not the simple {src: "hda", id: N} format that JobInternalToolState expects.
Implications: The workflow step execution callback (ToolModule.execute()) would need restructuring for CWL tools. Instead of visiting tool.inputs (which uses the legacy Galaxy parameter tree with FieldTypeToolParameter/Conditional/Repeat), it would need to build a CWL-format input dict directly from the step connections.
This preserves the legacy code for workflow execution while allowing the tool request API to use the clean path.
Implications: param_dict_to_cwl_inputs(), to_cwl_job(), and galactic_flavored_to_cwl_job() must be retained. FieldTypeToolParameter, Conditional, Repeat CWL-specific parameter types must be retained. The legacy representation.py code stays alive.
Approach C: Bypass the Legacy Parameter System for CWL Workflow Steps
The most ambitious approach: change ToolModule.execute() to handle CWL tools differently. Instead of visit_input_values(tool.inputs, execution_state.inputs, callback), build the CWL input dict directly from step connections:
iftool.tool_typeinCWL_TOOL_TYPES:
# Build CWL input dict directly from connectionscwl_input_dict= {}
forinput_name, connectionsinstep.input_connections_by_name.items():
replacement=progress.replacement_for_connection(connections[0])
cwl_input_dict[input_name] =to_cwl_reference(replacement) # {src: "hda", id: N}# Handle valueFrom expressions (still needs to_cwl/from_cwl for JS eval)
...
internal_state=JobInternalToolState(cwl_input_dict)
# Pass to execute with validated_param_combinations
This would allow removing FieldTypeToolParameter entirely and all the conditional/repeat CWL hacks.
How Runtimeify Affects Workflow Step Execution
The runtimeify plan (converting {src: "hda", id: N} references to CWL File objects with paths) works the same for workflow-invoked CWL tools as for directly-invoked ones -- if we can get a JobInternalToolState set on the job. The runtimeify step happens in ToolEvaluator.set_compute_environment() (evaluation.py:200-222), which runs for all jobs regardless of how they were submitted.
The key question is: who builds the JobInternalToolState for workflow-invoked CWL tools?
What About to_cwl() / from_cwl() in modules.py?
These functions are used for valueFrom/when expression evaluation and are independent of the tool execution path. They:
Convert Galaxy model objects to CWL-format dicts for JavaScript
Convert JavaScript results back to Galaxy model objects
The validated runtime plan does NOT affect these. Even with Approach C, valueFrom expressions still need to_cwl() to convert HDAs to CWL File objects for JavaScript evaluation. However, to_cwl() in modules.py is much simpler than to_cwl_job() in representation.py:
It works with Galaxy model objects (HDAs), not DatasetWrappers
It doesn't handle conditionals, repeats, or the _cwl__type_/_cwl__value_ encoding
It doesn't need secondary files, checksums, or full CWL compliance
Impact on CWL Scatter
CWL scatter maps to Galaxy's collection-based parallelism. The scatter handling happens at the workflow runner level (_find_collections_to_match() in modules.py:591-640) and in Galaxy's standard collection mapping infrastructure. This is independent of the tool runtime code.
The validated runtime plan does NOT affect scatter. The scatter produces multiple execution slices (param_combinations), each of which becomes a separate job. Each job goes through the same exec_before_job path.
However, if Approach C is taken (bypassing legacy parameter system for CWL workflow steps), scatter handling would need to work with the new input dict format instead of the legacy execution_state.inputs.
Unresolved Questions
Which approach for workflow support? A (make workflow runner set validated state), B (dual path in exec_before_job), or C (bypass legacy param system for CWL workflow steps)? B is safest/fastest but prevents cleanup. C is cleanest but most work.
Can we ship tool-only support first? The tool request API path works for direct CWL tool invocation. Can workflows remain on the legacy path (Approach B) while we iterate toward Approach A or C?
How do FieldTypeToolParameter instances get created for CWL tools with has_galaxy_inputs=False? If CWL tools no longer have Galaxy-style inputs (because the new commits stripped InputInstance to just name/label/description), does tool.inputs contain FieldTypeToolParameter instances at all? If not, the visit_input_values callback in ToolModule.execute() won't match any inputs, and the wiring will be broken for CWL workflow steps.
What parameter model does CwlToolSource produce now? The parse_input_pages() changes in the WIP commits affect what tool.inputs looks like for CWL tools. If tool.inputs is empty, the entire visit_input_values loop in ToolModule.execute() becomes a no-op, and inputs won't be wired correctly for workflow execution.
Does to_galaxy_parameters() need to survive? It's called from inputs_from_dict() which handles inputs_representation="cwl" on the old /api/tools endpoint. If we drop legacy API support, this can go. But do workflows use it? (Answer: no, the workflow runner builds state differently.)
How do workflow step defaults interact with the new path? CWL step input defaults (parser.py:1014-1015) are stored on WorkflowStepInput.default_value. These can be CWL File dicts. The workflow runner calls raw_to_galaxy() to convert them to HDAs. This is independent of the tool runtime, but raw_to_galaxy lives in basic.py near the code we might want to clean up.
Can raw_to_galaxy() move out of basic.py? It's a general utility for creating deferred HDAs from CWL dicts. It doesn't depend on FieldTypeToolParameter. Moving it to a CWL-specific module (e.g., tools/cwl_runtime.py) would decouple it from basic.py cleanup.
How do expression.json outputs flow between CWL workflow steps? CWL ExpressionTools produce expression.json datasets. Downstream steps that receive these check hda.ext == "expression.json" and read the JSON content instead of using the HDA as a file (modules.py:2531-2548, modules.py:2571-2589). This is handled in the workflow runner, not in tool runtime code. Does it still work with the new parameter model?
What about flat_crossproduct scatter? The parser generates it (parser.py:1008) but the workflow runner only asserts dotproduct or disabled (modules.py:602). Is this a pre-existing gap or something that needs addressing for the runtime plan?
Test coverage for CWL workflows?test_workflows_cwl.py has tests for simple workflows, multi-step, scatter, and collection outputs. These all go through the standard invoke_workflow_and_assert_ok path which uses the workflow runner, NOT the tool request API. These tests will catch workflow regressions.
CWL Validated Runtime: Workflow Adaption Plan (Approach C)
Bypass the legacy Galaxy parameter system for CWL workflow steps entirely. Build CWL input dicts directly from step connections, set validated_param_combinations on MappingParameters, and converge tool-invoked and workflow-invoked CWL tools onto the same validated runtime path.
Branch: cwl_on_tool_request_api_2
What to Do During CWL_VALIDATED_RUNTIME_PLAN Implementation to Set This Up
These actions during the tool-only runtime plan make the workflow adaption easier and more likely to succeed. None break tool-only functionality.
This ensures existing CWL workflow tests keep passing while you iterate on tool-only runtimeify. The fallback path fires for workflow-invoked CWL tools (where validated_tool_state is None). Remove it only after Approach C lands.
2. Ensure CwlFileParameterModel is recognized by runtimeify
The runtimeify visitor checks isinstance(parameter, DataParameterModel). CWL file parameters use CwlFileParameterModel which extends BaseGalaxyToolParameterModelDefinition, not DataParameterModel. The visitor won't recognize them without a fix.
During tool-only plan: Either extend the visitor's to_runtime_callback to also check for CwlFileParameterModel / CwlDirectoryParameterModel, or make those types inherit from (or register as) DataParameterModel. This is needed for tool-only runtimeify too, so it's not premature work.
3. Design the CWL input dict format to be workflow-friendly
When building the JobInternalToolState for tool-only invocations, ensure the dict format is:
This same format is what the workflow adaption will build from step connections. Don't embed Galaxy objects or wrapper dicts. Keep it serializable JSON with dataset refs as {src, id}.
4. Validate runtimeify works without collection inputs initially
The workflow adaption needs collection inputs eventually (for scatter), but the tool-only plan can punt on adapt_collection for CWL. Just ensure it doesn't crash -- raise NotImplementedError with a clear message. Scatter handling is Step 5 of this plan.
5. Move raw_to_galaxy() out of basic.py
raw_to_galaxy() creates deferred HDAs from CWL File dicts. The workflow runner needs it for valueFrom expression results. Move it to tools/cwl_runtime.py (the new file from the tool-only plan) during that work so basic.py cleanup is unblocked later.
6. Add workflow CWL tests to CI monitoring
Run test_workflows_cwl.py in CI during the tool-only work. These tests use the legacy workflow path and should keep passing through the dual-path in exec_before_job. If they break, you'll know something in the tool-only changes affected the legacy path.
Prerequisites
CWL_VALIDATED_RUNTIME_PLAN mostly working: tool-only CWL execution via tool request API using runtimeify, CwlDataInternalJson with secondary files, validated_tool_state flowing through exec_before_job
exec_before_job dual-path in place (fallback to legacy for workflows)
Existing CWL workflow tests passing via the legacy fallback
Step 1: Build CWL Input Dict from Step Connections
Goal: Replace visit_input_values(tool.inputs, execution_state.inputs, callback) for CWL tools with direct dict construction from step connections.
Why: With has_galaxy_inputs=False, CWL tools have no Galaxy-style tool.inputs tree. The visit_input_values callback can't wire inputs. Even if tool.inputs were populated, the callback uses isinstance(input, FieldTypeToolParameter) which depends on the legacy parameter model we're eliminating.
1a: New function: build_cwl_input_dict()
Where: lib/galaxy/workflow/modules.py, new function
expression.json handling: CWL ExpressionTools produce expression.json datasets. When a downstream CWL step receives one as a non-data input (scalar parameter), the JSON content should be extracted and passed as the scalar value. This matches the existing behavior in to_cwl() at modules.py:181-183.
Contingency: If expression.json detection causes issues (e.g., a CWL tool genuinely wants the file, not its contents), we may need to check the CWL input type from the tool's parameter model to decide. The CWL parameter model knows whether an input is File type vs scalar.
1c: Wire into ToolModule.execute()
Where: lib/galaxy/workflow/modules.py, ToolModule.execute() around line 2476
# In ToolModule.execute(), before the visit_input_values callback:iftool.tool_typeinCWL_TOOL_TYPESandnottool.has_galaxy_inputs:
# Approach C: bypass legacy parameter system for CWLcwl_input_dict=build_cwl_input_dict(step, progress, trans)
# ... handle valueFrom expressions (Step 2)# ... handle scatter (Step 5)# ... build JobInternalToolState and MappingParameters (Step 3)else:
# Legacy path for Galaxy toolsvisit_input_values(tool_inputs, execution_state.inputs, callback, ...)
# ... existing code
Key considerations
step.input_connections_by_name maps input names to lists of WorkflowStepConnection objects. Each connection points to an upstream step's output.
Multiple connections to the same input indicate merged inputs (collections). Need the same merge logic as replacement_for_input_connections().
Step defaults from WorkflowStepInput.default_value need to be applied for unconnected inputs.
Step 2: valueFrom Expression Evaluation with CWL Input Dict
Goal: Evaluate CWL valueFrom JavaScript expressions against the CWL input dict, using to_cwl()/from_cwl() for Galaxy object conversion.
2a: Adapt evaluate_value_from_expressions for CWL dict inputs
The existing evaluate_value_from_expressions() (modules.py:2426-2458) works with execution_state.inputs containing FieldTypeToolParameter-wrapped values. For Approach C, the input dict already has CWL-native references.
defevaluate_cwl_value_from_expressions(
step: WorkflowStep,
cwl_input_dict: dict,
progress: WorkflowProgress,
trans,
) ->dict[str, Any]:
"""Evaluate CWL valueFrom expressions against a CWL input dict. Modifies cwl_input_dict in place with expression results. """value_from_map= {}
forstep_inputinstep.inputs:
ifstep_input.value_from:
value_from_map[step_input.name] =step_input.value_fromifnotvalue_from_map:
returncwl_input_dict# Convert input dict to CWL format for JS evaluationhda_references= []
step_state= {}
forkey, valueincwl_input_dict.items():
step_state[key] =_ref_to_cwl(value, hda_references, trans, step)
# Evaluate each valueFrom expressionforkey, value_frominvalue_from_map.items():
context=step_state.get(key)
result=do_eval(value_from, step_state, context=context)
# Convert CWL result back to input dict referencecwl_input_dict[key] =_cwl_result_to_ref(
result, hda_references, progress, trans
)
returncwl_input_dict
2b: Helper: _ref_to_cwl()
Convert an input dict reference ({src: "hda", id: N} or scalar) to CWL format for JavaScript evaluation:
Note: from_cwl() may call raw_to_galaxy() to create deferred HDAs from expression results that produce CWL File objects. This is correct -- the deferred HDA gets an ID, and we store {src: "hda", id: N} in the input dict.
2d: when_expression evaluation
when expressions (step conditional execution) also need adaptation. These are evaluated separately (modules.py:245-298) and return boolean. The existing code converts execution_state.inputs via to_cwl(). For CWL tools, use the same _ref_to_cwl() conversion on the CWL input dict.
Contingency: If the when_expression evaluation is too entangled with the existing code flow, it can be handled with a simple conditional: convert the CWL input dict refs to Galaxy objects for the existing to_cwl() path. This is less clean but works.
Step 3: Build JobInternalToolState and Set validated_param_combinations
Goal: Construct a JobInternalToolState from the CWL input dict and pass it through MappingParameters so job.tool_state gets set.
3a: Create JobInternalToolState
# After building cwl_input_dict and evaluating valueFrom:internal_tool_state=JobInternalToolState(cwl_input_dict)
internal_tool_state.validate(tool, f"{tool.id} (workflow step)")
Potential issue: validate() calls create_job_internal_model() which uses the tool's parameter model. For CWL tools, this model uses CwlFileParameterModel etc. The validation needs to accept {src: "hda", id: N} dicts for file parameters. This should work if CwlFileParameterModel.py_type returns DataRequest and the pydantic model accepts this format. Verify during implementation.
3b: Build MappingParameters with validated state
# For non-scatter case (single param_combination):param_combinations= [cwl_input_dict]
validated_param_combinations= [internal_tool_state]
mapping_params=MappingParameters(
param_template=cwl_input_dict,
param_combinations=param_combinations,
validated_param_template=None, # Not needed for workflow pathvalidated_param_combinations=validated_param_combinations,
)
Why this works: execute_single_job() at execute.py:254-256 checks execution_slice.validated_param_combination and if present, sets job.tool_state = validated_param_combination.input_state. This is the same path the tool request API uses.
Once job.tool_state is set, ToolEvaluator.set_compute_environment() reconstructs the JobInternalToolState, runtimeify converts dataset refs to File objects, and exec_before_job receives the runtimeified state -- the same path as tool-only invocation.
Step 4: Handle Input/Output Dataset Associations
Goal: Ensure Galaxy creates the right input/output dataset associations for the job.
4a: Input dataset registration
The workflow runner's ToolModule.execute() currently passes execution_state.inputs (with Galaxy objects) to tool.handle_single_execution(), which creates JobToInputDatasetAssociation entries. With Approach C, the input dict has {src: "hda", id: N} references, not Galaxy objects.
Two options:
Convert refs back to objects for handle_single_execution: Before calling execute(), build the legacy params dict with actual HDA/HDCA objects that _execute() expects for input dataset association.
Let the tool request API path handle it: If the tool request API path already creates input associations from {src, id} references (via the validated state), the workflow path should match. Verify this.
Contingency: If input dataset association is broken, add a post-processing step that reads job.tool_state and creates JobToInputDatasetAssociation entries for each {src: "hda", id: N} reference.
4b: Output dataset creation
Output datasets are created by tool.handle_single_execution() → DefaultToolAction.execute(). This creates output HDAs based on tool.outputs. This is independent of the input format and should work unchanged.
Step 5: Scatter / Collection Mapping
Goal: Make CWL scatter work with the new CWL input dict format.
CWL scatter maps to Galaxy's implicit collection mapping. The workflow runner uses _find_collections_to_match() to identify which inputs have collections that should be iterated over.
5a: Identify scatter inputs
For CWL tools, scatter inputs are marked via step_input.scatter_type (set during workflow import from CWL scatter declarations).
When scatter produces multiple execution slices, each slice has different input values (individual elements from the scattered collection). Build param_combinations and validated_param_combinations as lists -- one entry per slice.
galactic_flavored_to_cwl_job() in representation.py
dataset_wrapper_to_file_json() in representation.py
TYPE_REPRESENTATIONS dict in representation.py
FieldTypeToolParameter in basic.py (if no other callers)
CWL-specific Conditional/Repeat handling in representation.py
6c: FieldTypeToolParameter removal
Check for remaining callers:
modules.py:2567 -- isinstance(input, FieldTypeToolParameter) in the visit_input_values callback. With Approach C, CWL tools take the new branch; this isinstance check only fires for legacy Galaxy tools (which never use FieldTypeToolParameter). Safe to remove.
modules.py:2647 -- expression_callback. Same situation.
basic.py:2924 -- from_json(). Only called when deserializing tool state with FieldTypeToolParameter inputs. Dead with new path.
Preserve: raw_to_galaxy() (now in cwl_runtime.py per setup step 5). Still needed by from_cwl() for valueFrom expression results.
Step 7: Clean Up basic.py
Goal: Remove all CWL-specific code from basic.py.
Remove FieldTypeToolParameter class (basic.py:2907-2970)
Remove raw_to_galaxy() (already moved to cwl_runtime.py)
Remove CWL imports and TYPE_REPRESENTATIONS references
Result: basic.py has zero CWL-specific code. The CWL branch modifies basic.py not at all.
Testing Strategy
Unit tests (new)
test_build_cwl_input_dict(): Mock step connections and progress outputs. Verify dict has correct {src, id} refs for HDAs, scalar values for parameters, parsed JSON for expression.json HDAs.
test_evaluate_cwl_value_from_expressions(): Given a CWL input dict with HDA refs and a valueFrom expression, verify the expression sees CWL File objects and results are converted back to refs.
test_cwl_scatter_expansion(): Given a CWL input dict with HDCA ref and scatter_type=dotproduct, verify param_combinations has one entry per collection element.
valueFrom expressions in workflow steps: No existing test. Add one using a CWL workflow with valueFrom on a step input.
Secondary files between workflow steps: A tool produces output with secondaryFiles, next step consumes it. Verify secondary files survive the runtimeify path.
when expressions: No existing test for CWL when. Add if CWL branch supports it.
Steps 1-3 form the minimum viable workflow support. Run test_simplest_wf after Step 3.
Step 2 can be deferred if no existing workflow tests use valueFrom. Check by running tests after Step 3 with a stub that skips valueFrom evaluation.
Step 5 (scatter) is needed for test_scatter_wf1_v1 and test_count_lines3_v1.
Steps 6-7 are cleanup after all tests pass.
Contingencies
If CwlFileParameterModel validation fails
The JobInternalToolState.validate() call may fail if the CWL pydantic parameter model doesn't accept {src: "hda", id: N} dicts for file inputs.
Fix: Check what CwlFileParameterModel.py_type returns. It's DataRequest. The DataRequest type should accept {src, id} format. If not, adjust the pydantic template method to accept this format in the job_internal state representation.
Fallback: Skip validation temporarily (internal_tool_state = JobInternalToolState(cwl_input_dict) without .validate()) and fix the model later.
If expression.json detection is wrong
Some CWL tools might accept expression.json as a File input (not wanting the parsed content). The current _galaxy_to_cwl_ref() always parses expression.json HDAs.
Fix: Consult the CWL tool's parameter model to determine if the input expects File type. If so, return {src: "hda", id: N} even for expression.json datasets. Only parse the JSON for scalar-typed inputs.
If input dataset association breaks
The _execute() function in execute.py creates JobToInputDatasetAssociation entries by walking the params dict. If the CWL input dict format differs from what _execute() expects:
Fix: Add CWL-specific input association creation in execute_single_job() or a callback. Iterate over validated_param_combination.input_state, find {src: "hda", id: N} refs, and create associations.
If scatter with nested collections fails
CWL dotproduct scatter over a list collection is straightforward. Nested collections or multi-variable scatter may hit edge cases in CollectionsToMatch.
Fix: Implement nested scatter incrementally. Start with single-variable flat list scatter. The existing test infrastructure only tests this case anyway.
If to_cwl/from_cwl changes are needed for new format
The existing to_cwl() function works with Galaxy model objects. The new path stores {src, id} refs in the dict, so _ref_to_cwl() needs to look up objects by ID. This requires a database session.
If session access is awkward: Pre-resolve all {src, id} refs to Galaxy objects before calling expression evaluation. Build a lookup dict once, reuse it.
Unresolved Questions
Does CwlFileParameterModel.pydantic_template() return a model that accepts {src: "hda", id: N} for job_internal state representation? If it just returns DataRequest without state-specific templates, validation may fail.
How does _execute() in execute.py create JobToInputDatasetAssociation? Does it walk params looking for HDA objects, or does it use validated_param_combination? If the former, the CWL input dict with {src, id} refs won't have HDA objects to find.
Do any CWL workflow tests exercise step.state.inputs (user-supplied step parameters at invocation time)? If so, Step 1's build_cwl_input_dict() needs to merge those into the dict.
How do CWL subworkflow steps work in ToolModule.execute()? Subworkflows use a different module type (SubWorkflowModule), not ToolModule. Approach C only changes ToolModule.execute(). Subworkflows should be unaffected, but verify.
What's the interaction between MappingParameters.validated_param_template and the workflow path? Tool request API sets both template and combinations. Workflow path may only need combinations. Does execute() require the template to be non-None?
For collection inputs to CWL tools (non-scatter), how does the {src: "hdca", id: N} reference flow through runtimeify? The adapt_collection callback in setup_for_runtimeify() expects DataCollectionRequestInternal. Does the CWL runtimeify path handle this?
How Galaxy uses cwltool — every import, every API call, and whether those calls make sense for the migration.
Branch: cwl_on_tool_request_api_2
Dependency Overview
cwltool is the CWL reference runner. Galaxy uses it as a library, not a CLI — loading CWL documents, generating commands, staging files, and collecting outputs. Galaxy never runs cwltool as a subprocess.
All cwltool imports go through a single wrapper module (cwltool_deps.py) with try/except guards so cwltool remains an optional dependency. Only parser.py and schema.py call cwltool APIs directly. Everything else uses plain Python dicts that happen to match the CWL spec.
{"type": "record", "fields": [...]} — input type definitions
.outputs_record_schema
dict
Same structure for outputs
.schemaDefs
dict
Maps type names to resolved schema definitions
.requirements
list
CWL requirements (modified in-place by _hack_cwl_requirements)
.hints
list
CWL hints (DockerRequirement moved here)
Key operations on Process:
Format stripping (__init__, line 145-148): Removes "format" from input field definitions so cwltool won't complain about missing format in input data. This is a workaround — Galaxy doesn't track CWL format URIs on datasets.
Schema definition resolution (input_fields, line 293-294): Looks up input_type in schemaDefs to resolve named types (e.g., SchemaDefRequirement types).
Serialization (to_persistent_representation): Serializes .tool dict + .requirements + .metadata["cwlVersion"] to JSON for database storage.
DockerRequirement hack (_hack_cwl_requirements, line 893-901): Moves DockerRequirement from .requirements to .hints so cwltool doesn't try to run containers — Galaxy handles containerization independently.
Assessment: All attribute access is on well-established cwltool Process properties. The format stripping and Docker hack are legitimate bridging concerns. The serialization approach (persisting .tool dict) works because cwltool can reconstruct a Process from a raw CWL dict via fetch_document() + make_tool().
2.2 JobProxy — Wraps cwltool.job.Job (lazily)
This is the heaviest cwltool integration point.
_normalize_job() — Preparing inputs for cwltool (line 376-391):
RuntimeContext({}) — empty context just to get filesystem access factory
getdefault(runtime_context.make_fs_access, StdFsAccess) — get fs_access class with StdFsAccess as default
process.fill_in_defaults(inputs_list, input_dict, fs_access) — fills default values into _input_dict in-place
visit_class(obj, class_names, callback) — converts "path" keys to "location" in File/Directory objects
Assessment: Correct API usage. fill_in_defaults is cwltool's standard way to apply CWL default values. visit_class is the standard recursive visitor. The pathToLoc conversion is necessary because Galaxy internally uses path but cwltool expects location.
_ensure_cwl_job_initialized() — Creating the cwltool Job (line 354-374):
use_container=False — Galaxy wraps the command in its own container layer later
select_resources=self._select_resources — callback that injects SENTINEL_GALAXY_SLOTS_VALUE (1.480231396) for cores since the real slot count isn't known at job-preparation time
Shallow copy of Process + deep copy of inputs_record_schema — because Process.job() mutates inputs_record_schema in place (not thread-safe)
next() on the generator — cwltool's job() returns a generator, Galaxy only needs the first (and only) job
Assessment: The copy pattern is a legitimate workaround for cwltool's in-place mutation. The GALAXY_SLOTS sentinel hack is fragile but necessary — cwltool needs a concrete number for ResourceRequirement evaluation at job-construction time. The use_container=False is correct — Galaxy's container system handles Docker/Singularity.
Job object properties accessed (on the object returned by Process.job()):
Property
Type
CommandLineTool
ExpressionTool
command_line
list[str]
Command fragments
N/A (checked via hasattr)
stdin
str | None
Stdin redirect path
N/A
stdout
str | None
Stdout redirect path
N/A
stderr
str | None
Stderr redirect path
N/A
environment
dict
EnvVarRequirement vars
N/A
generatefiles
dict
{"listing": [...]} for InitialWorkDirRequirement
N/A
pathmapper
PathMapper
Input file path mapping
N/A (checked via hasattr)
inplace_update
bool
InlineJavascriptRequirement flag
N/A
Assessment: These are all standard cwltool Job properties. Galaxy distinguishes CommandLineTool from ExpressionTool via hasattr(cwl_job, "command_line") which is the correct check — ExpressionTool jobs don't have a command_line attribute.
Job methods called:
Method
When
Signature
collect_outputs(outdir, rcode)
CommandLineTool post-execution
Returns dict[str, CWLObjectType] — output name to CWL value mapping
Assessment: Correct usage. For CommandLineTools, collect_outputs evaluates output glob patterns against the working directory. For ExpressionTools, run() executes the JavaScript and delivers results via callback. The empty RuntimeContext for expression execution is fine — expressions don't need runtime configuration.
cwltool calls this during job construction to resolve ResourceRequirement. Galaxy substitutes a sentinel float for cores, then replaces it back with $GALAXY_SLOTS in the command_line property:
Assessment: This is a hack but functional. The sentinel value (1.480231396) is unlikely to appear naturally. The real fix would be deferring job construction to the compute node where slot count is known, but that would require architectural changes.
Galaxy's stageFunc creates symlinks (os.symlink). Two passes:
Input files — symlinked from Galaxy dataset paths to cwltool staging
Generated files (InitialWorkDirRequirement) — staged into working directory, then relinked
Assessment: This follows cwltool's own staging pattern. The separateDirs=False is correct for Galaxy's flat working directory. The ignore_writable=True for inputs prevents cwltool from trying to copy writable files. The symlink=False parameter to process.stage_files means Galaxy's custom stageFunc handles linking, not cwltool's default.
rewrite_inputs_for_staging() (line 393-431):
This method has a commented-out block for pathmapper-based rewriting, with an active fallback that manually symlinks files whose location doesn't match their basename. This is a workaround for files that cwltool's staging doesn't handle (e.g., expression tools without a pathmapper).
Assessment: The commented-out code suggests this is incomplete. The active fallback is functional but inelegant — it manually traverses the input dict looking for location/basename mismatches.
This serializes the full CWL context to .cwl_job.json so the output collection script can reconstruct a JobProxy post-execution.
Assessment: This round-trip works because to_persistent_representation() captures the raw CWL tool dict, and from_persistent_representation() feeds it back through cwltool's loading pipeline (with strict_cwl_validation=False for speed). The non-strict loader avoids re-validating at output collection time.
Step definition — run, class, inputs, scatter, scatterMethod, when
.id
str
Step identifier
.requirements
list
Step-level requirements
.hints
list
Step-level hints
.embedded_tool
Process | Workflow
The nested Process or sub-Workflow for inline tools
Assessment: All access is on standard cwltool Workflow/WorkflowStep properties. Galaxy reads these to convert CWL workflows into Galaxy's internal workflow format. No mutation of cwltool objects here.
3. runtime_actions.py — Output Collection
Direct cwltool calls: Only ref_resolver.uri_file_path(location) for converting file:// URIs to filesystem paths.
Indirect cwltool calls:job_proxy.collect_outputs() which delegates to cwltool's Job.collect_outputs() or Job.run().
The rest is pure Galaxy logic — moving files, handling secondary files, writing galaxy.json metadata.
Assessment: Clean separation. The only cwltool dependency is through JobProxy (which is correct) and the URI resolver (which is a simple utility).
4. representation.py — Zero cwltool API Calls
Despite being the CWL↔Galaxy translation layer, this module never calls cwltool directly. It constructs plain Python dicts matching the CWL spec ({"class": "File", "location": ...}). The only cwltool-adjacent reference is _cwl_tool_proxy.input_fields() which is Galaxy's ToolProxy method.
Assessment: This is actually a good thing. The representation layer's problems are conceptual (the round-trip hack), not dependency-related. Eliminating it doesn't remove any cwltool API usage.
5. util.py — Zero cwltool API Calls
galactic_job_json(), output_to_cwl_json(), and the upload target classes all work with plain CWL-spec dicts. No cwltool dependency.
6. parser/cwl.py (CwlToolSource) — Indirect Only
All cwltool access goes through ToolProxy methods. The one exception is self.tool_proxy._tool.tool.get("successCodes") which reads the raw CWL dict for exit code parsing.
JobProxy's cwltool API usage stays the same either way. The calls to fill_in_defaults, Process.job(), stage_files, and collect_outputs are unchanged.
What Stays
All cwltool API calls in parser.py are correct for the migration. The proxy layer is well-designed — it isolates cwltool behind clean interfaces. The three areas that remain:
Document loading (schema.py) — Unchanged by migration
Job execution (JobProxy) — Input format changes, API calls stay the same
Output collection (runtime_actions.py via JobProxy) — Unchanged by migration
What the Runtimeify Step Must Produce
For _normalize_job() and subsequently Process.job() to work, the input dict must contain:
File inputs as {"class": "File", "path": "/absolute/path", ...} or {"class": "File", "location": "file:///path", ...} (the pathToLoc callback in _normalize_job converts path to location)
Directory inputs similarly
Scalar values as plain Python types
Optional null inputs omitted or set to None
fill_in_defaults will fill in any missing inputs that have CWL defaults
This is what the CWL-specific runtimeify must produce from JobInternalToolState (which has {src: "hda", id: N} references).
Potential Simplification
normalizeFilesDirs is imported but unused — could be cleaned up. The command_line_tool import is also dead. Neither affects functionality.