Skip to content

First attempt at embarrasingly parallel execution to compute head grids.#140

Draft
dbrakenhoff wants to merge 14 commits into
mainfrom
parallel_numba
Draft

First attempt at embarrasingly parallel execution to compute head grids.#140
dbrakenhoff wants to merge 14 commits into
mainfrom
parallel_numba

Conversation

@dbrakenhoff

@dbrakenhoff dbrakenhoff commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Early result suggests ~10x speedup on my machine.

Todo

  • Implement to_numba_tuple() for all element families intended for Numba path.
    Add fast potinf and potential kernels for remaining element classes. Missing:
    • CircAreaSink
    • LeakyWall
    • LeakyWallString
    • WellStringBase
  • Support aquifer-region aware evaluation in Numba backend. Reason: model currently
    selects aquifer by location; backend now uses a single aqtuple and ignores aq_id.
  • Preserve Model feature parity in Numba headgrid path. Must include: layers
    selection behavior, steady contribution, and output shape conventions.
  • Add Numba-optimized discharge-vector path: disvec and velocity grids still route
    through Python loops. Write fast parallel implementations for disvec_inf, disvec
    functions.
  • Testing
    • Add correctness tests comparing serial vs Numba-parallel across element mixes.
    • Add regression tests for edge cases that often break Numba kernels. Examples:
      points at well radius, near singular line endpoints, zero/very small times, many
      intervals, mixed g/v/z elements.
  • Documentation
    • Add help for contributors describing how to add new elements to fast numba code.

- add parallel submodule
- define numba tuples containing relevant data for computations for Model, Aquifer and Elements (LineSink and Well) for now. Each class gets to_numba_tuple() method to collect data.
- add integer mappings for elements and boundary conditions for identifying computation path
- gather element data in structured arrays
- write fast versions of potinf and potential
- parallelize on x,y pts
@dbrakenhoff dbrakenhoff self-assigned this Jun 18, 2026
@dbrakenhoff dbrakenhoff added the enhancement New feature or request label Jun 18, 2026
- add parallel submodule
- define numba tuples containing relevant data for computations for Model, Aquifer and Elements (LineSink and Well) for now. Each class gets to_numba_tuple() method to collect data.
- add integer mappings for elements and boundary conditions for identifying computation path
- gather element data in structured arrays
- write fast versions of potinf and potential
- parallelize on x,y pts
@dbrakenhoff

dbrakenhoff commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Parallel example (based on example from Discussion #115)

parallel_example.zip

Note that first run of numba code triggers the compilation step, which means it runs in about ~15s on my machine. The next run takes about ~3s. Normal timflow (using parallel=True, referring to multithreading) is about ~30s.

EDIT: example won't run until #139 is merged into dev and subsequently this branch.

@eriktoller

Copy link
Copy Markdown

Cool work and awesome to see such speed-up already!

Some thoughts:

  • It could be a good idea to skip fastmath=True, at least when still developing the code
  • it would be interesting to look at the difference if you change results_arr=bessellsv2(...) to bessellsv2(restuls_arr,...) and reuse a work array rather than creating a new one per linesink per point (300 linesink for 100 by 100 grid will be a lot of time for memory allocation).
  • it would also be cool if you could time the pre-processing, computations and post-processing of the numba approach to see where the majority of the effort goes into.

I will follow the progress with great interest, great work Davíd 🎉

@dbrakenhoff

Copy link
Copy Markdown
Contributor Author

@eriktoller Thanks for the early thoughts!

  • It could be a good idea to skip fastmath=True, at least when still developing the code

Good point, I did test it for my little example and it gave the same results, but better to develop it without for now and compare at the end. It also didn't really generate any speedup in my current example.

  • it would be interesting to look at the difference if you change results_arr=bessellsv2(...) to bessellsv2(restuls_arr,...) and reuse a work array rather than creating a new one per linesink per point (300 linesink for 100 by 100 grid will be a lot of time for memory allocation).

Good suggestions, I left the existing numba code alone for now, but an (optional) work array output seems like a good idea.

  • it would also be cool if you could time the pre-processing, computations and post-processing of the numba approach to see where the majority of the effort goes into.

The pre-processing is currently only 0.05% of the total computation time in my current example. Maybe it will be a bit more if I include all the code up to the first call to the numba optimized potential computations, but for now it seems negligible. But it will become more as we start adding in support for more elements probably. So good to keep an eye on that.

So >99% of the time is taken by the actual potential computations, and this is how my example script scales with the number of threads on my laptop (averages of 3 runs). The base case (1 thread) runs in ~35s.

image

@eriktoller

Copy link
Copy Markdown

@dbrakenhoff

So >99% of the time is taken by the actual potential computations, and this is how my example script scales with the number of threads on my laptop (averages of 3 runs). The base case (1 thread) runs in ~35s.

That is really impressive and shows great potential! For larger timflow models this will be a substantial upgrade when it come to plotting.

Did @mbakker7 have a look at this too?

- avoid memory assignment in loops
@dbrakenhoff

Copy link
Copy Markdown
Contributor Author

My early attempt at parallel computations within regular timflow using multithreading was a bit misguided. I now modified it to use multiprocessing instead of multithreading (#139) and this is the result when compared to the fully optimized numba in this PR.

image

@mbakker7

Copy link
Copy Markdown
Contributor

@dbrakenhoff

So >99% of the time is taken by the actual potential computations, and this is how my example script scales with the number of threads on my laptop (averages of 3 runs). The base case (1 thread) runs in ~35s.

That is really impressive and shows great potential! For larger timflow models this will be a substantial upgrade when it come to plotting.

Did @mbakker7 have a look at this too?

Yes, @eriktoller , I am in the loop. Mostly simply talking to @dbrakenhoff rather than giving comments here. Thanks for all your suggestions. Keep them coming!

Base automatically changed from dev to main June 26, 2026 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants