On a macOS arm64 (M1) machine, running the NumPy test suite:
Default build:
pixi r test -- --collect-only: 10 spixi r test(1 thread): 163 s.pixi r test -j2(pytest-xdist): 116 s.
Free-threaded build (python 3.14):
pixi r test-nogil -- --collect-only: 10 spixi r test-nogil -- --collect-only --parallel-threads=2: 30 spixi r test-nogil(1 thread): 167 s.pixi r test-nogil -j2(pytest-xdist): 123 s.pixi r test-nogil -- --parallel-threads=2: 270 s.pixi r test-nogil -j2 -- parallel-threads=2: 192 s.
Now with submodule test selection:
pixi r test-nogil -- numpy/_core: 57 s.pixi r test-nogil -- numpy/f2py: 80 s.pixi r test-nogil -- --ignore=numpy/f2py: 78 s.
Putting that all together:
pixi r test-nogil -- --parallel-threads=2 --ignore=numpy/f2py: 186 s.pixi r test-nogil -j2 -- --parallel-threads=2 --ignore=numpy/f2py: 123 s.
Benchmarks show that M1 vs. M3 Ultra is about 30% faster single-core, and 4x on a multi-core benchmark. For actually running multiple sets of tests in parallel in separate processes, performance should scale linearly with the number of cores - which would probably be roughly 5x faster, given 8 vs. 28 cores and 4 vs. 20 performance cores, times 30% per core. I'd expect the upcoming M5 Ultra to give us another 30% at least. So let's say 6.5x faster.
Now assume TSan makes everything 50x slower (that seems worst-case).
That should give a test suite runtime for the non-slow tests with
f2py excluded of roughly: 186 s. x 50/6.5 = 1430 s. (= 24 min.).
That assumes that we can shard the tests easily to run in independent processes.
If that's not the case and we're stuck with something like -j2,
then it's going to be more like 100 minutes runtime.
For SciPy, the test suite is heavier but definitely shardable across
submodules. The slowest submodule is scipy.stats, which takes a similar
amount of time as all of NumPy minus f2py:
pixi r test-nogil -- --parallel-threads=2 scipy/stats: 171 s.