Trie-based routing with Mustermann

Not sure if this is the right place.

I saw some discussion on moving hanami-router to be trie-based. This is something I’ve been considering for Mustermann back when I first implemented it… 13 years ago. Wow, time flies.

I’ve implemented a proof of concept for Mustermann and wanted to see whether there is interest on Hanami’s side, or if the plan would instead be to move away from Mustermann (or leave things as they are).

This does further increase compilation time. I’m happy to look into this, though. It should at least be possible to eliminate the need to compile the individual patterns.

I’m also not sure where your pain points are and what realistic route sets are like. Compiling 10k routes with four levels of nesting (each nesting level being one static and one dynamic segment) takes 3.5s on my machine. However, matching against these only takes 4.9μs (vs 0.7s when matching in a linear fashion). The break-even point seems to be around 50 routes; anything below that performs worse with the trie-based approach. Could also implement both a linear and a trie-based route-set matcher and switch between them automatically based on either a hard-coded value or some match simulation (thinking RouteSet#optimize), though that might be a bit much.

Anyway, if this is interesting to you, then I’ll work on finishing the Mustermann-side, and would be happy to help with hanami-router.

Hi, @rkh. This sounds interesting. I can’t speak for the core team, but I was the one doing the exploration you mentioned. It would definitely be worthwhile to discuss further.

To be clear, Hanami-router moved to a trie-based approach when @jodosha rewrote it for Hanami 2.0. He used Mustermann for segment matching, but it seems like Mustermann is overkill for that.

Mustermann is an incredibly powerful tool, but for the most basic (and most common) use case, which is matching whole segments, I don’t think Mustermann is needed at all. That was the strategy I was exploring in the branch you referenced.

Based on our findings, I think we can remove Mustermann for the basic usage, and keep it for more complicated matching (and expanding), like globs and sub-segment matching.

I’m curious to know more about your current work. What does a trie-based Mustermann look like? Does that imply a Mustermann construct for an entire routing tree? As opposed to the previous Mustermann object-per-route approach? I’d love to talk more.

Correct. The version I’m playing with right now would create one route set and give you the matched pattern, params, and a custom value you can assign to it. I’ve got matching to be fairly performant, and I think I have a way to still match multiple patterns. I think I’ll keep working on it either way and either do a POC with hanami-router, Sinatra, or a custom router to see how it performs on r10k.

Current work in progress API (might not be relevant to Hanami):

set = Mustermann::Set.new(type: :rails)

# Value can be anything, it is optional
set.add("/books/:id", "books.show")
set.add("/authors", "authors.index")
set.add("/books/:book_id/authors", "authors.index")

# matching
match = set.match("/books/1/authors")
match.value # => "authors.index"
match.params # => { "book_id" => "1" }

# expansion
set.expand("authors.index", {}) # => "/authors"
set.expand("authors.index", { books_id: 1 } # => "/books/1/authors"

Not sure what the impact would be compared to the current hanami-router implementation (happy to try it out and leave it be if it isn’t significant).

The trie already accounts for escaped and unescaped characters, which further avoids calls to URI when requests come in.

Where I’m at right now with optimizations is a 2,500x speedup (not missing a percentage sign) on matching of r10k-style routes (10k routes, 4 levels of nesting) compared to looping through an array of patterns, which is what Sinatra does.

My biggest concern is if this will work for complex scenarios, ie where a regular expression would perform backtracking. But maybe not supporting these is a fair-enough trade-off? I also feel like this is more an issue for Sinatra users, where /foo/*/bar/* might actually be in use. Hm… maybe I could just compile everything into one big regexp from the first splat on.

This is very cool! :green_heart: You are essentially creating a routing engine. This could be a powerful building block for any ruby web framework! It definitely lowers the barrier for entry with powerful, performant route matching and expanding in a single component. :nerd_face:

I just saw your comment on GitHub. I’d just as soon continue the conversation here, unless anyone thinks there’s more value to the community in that forum.

To be clear, I don’t think Mustermann is “too complex.” I think it’s awesome and super-powerful. But that power comes at a cost that is paid at compilation time (and possibly in memory use, although we did not identify that as a problem).

The Hanami router implements a trie as a deeply-nested Hash, with each complete segment acting as a String key in the Hash. The original approach used a Mustermann matcher for each dynamic segment. This turned out not to be very costly at compile time because the matching segments tended to use a small set of labels (e.g., :id is super-common), so the same matchers were frequently returned due to Mustermann’s internal memoization.

The problem with this approach was the inability to use different labels for dynamic segments appearing at similar locations in the route. The current implementation groups dynamic segments at the same route segment under one node in the trie, and then uses a Mustermann matcher for each complete route. This did prove costly at startup time, since we are now creating a matcher for every route. This is what led to my exploration with removing Mustermann.

Frankly, if you are only dealing with segment-level matching, I don’t see how any implementation (in pure ruby) will ever beat this Hash approach. Although I would love to learn if I am wrong. :nerd_face: However, I recognize that every additional feature request (globbing, sub-segment matching, multiple dynamic values per segment, etc.) brings us closer to re-implementing Mustermann.

I personally favor the approach of embracing additional power as needed. I would like to keep the core functionality with a Hash and simple String segments. I don’t know of anything faster than this. I would then embrace more complex approaches for more complex use-cases, as the core team determines that those use-cases should be supported. That is, I would use the current approach for the base case, without Mustermann for matching or expanding, and use different objects (including possibly Mustermann) as needed for more complex routes. Globbing is a good example: we could still use Mustermann for matching and expanding these routes, while retaining maximum performance for the more common base case.

What are your thoughts on this strategy?

1 Like

This makes a lot of sense. I do agree that if you wanted to implement most edge cases, you’d essentially end up rebuilding Mustermann or Journey. I also get that 80% of the use cases are extremely simple, with a single match.

I also agree on discussing this here rather than in the hanami-router PR. The only other place that would make sense is the Mustermann repository, but as long as none of the other maintainers chime in, I think this is fine.

I have found some improvements for the general pattern compile time on the way.

I think it would also be possible to compile patterns lazily, which could mean that full compilation isn’t even needed if all you use Mustermann for is expansion. However, there are some things I dislike about lazy compilation, including that I would expect an exception at pattern creation time if there’s a parse error. Not having that would mean a breaking change in Mustermann, which could have trickle-down effects on Sinatra, which is probably more in sustainability mode atm, so changing behavior that impacts it is something I’d want to avoid.

The trie I’m building atm is character-based, not segment-based. Again, I think this works better with handling escaping (as you don’t know upfront which characters might be escaped and which aren’t). This probably also isn’t relevant for most routes you’ll have in a RESTful CRUD application.

I’m happy to poke at it just for the fun of it. I do think Hanami is an interesting use case, as there’s more active development and still a lot to be built compared to the two other popular routers using Mustermann (Grape and Sinatra). And the more static routing compared to Sinatra means there are a lot fewer edge-cases (you can technically change env["PATH_INFO"] in a Sinatra route and then pass to the next route - issues around that were what made me give up on using a trie or some other approaches in the past). Eventually trying to incorporate this into hanami-router, even if that’s on a fork that never gets used, seems to be an easier way for me to see what the impact of this with a little less isolation would be. Assuming I’m not underestimating how complex the hanami-router internals are.

Interesting. How would lazy compilation affect forking web servers? I guess they’d each do their own compilation after forking?

Does the need for escaping implicate only non-RESTful routes? What are the use cases for this? I’m curious.

I did not know Sinatra could modify the path_info while processing. Is this common? It seems like dark hackery.

I think it would be pretty easy for you to replace our trie with a Mustermann::Set in Hanami router. Following my earlier exploration, I would like to start over and implement the improvements we discovered step-by-step in the router. This refactoring may make substituting Mustermann even easier. It sounds like a cool experiment. I hope we can continue our conversation as you go. :nerd_face:

Well, technically, any character can be URI escaped. And for non-meaningful characters (like a slash) they should be treated the same. So /books could also be /%62%6F%6F%6B%73 instead. Or any combination thereof. To the extent that most browsers will render both as /books when part of the URI. Not sure how hanami-router handles this at the moment.

I doubt anyone does this, but using pass is not uncommon, and there have been issues in the past whenever we tried to optimize route conditions (which can be arbitrary code in Sinatra). To be fair, though, I have taken a major step back from Sinatra maintenance about a decade ago, and haven’t had a Sinatra app in production with any real-world usage since 2019.

Oh, that sounds promising!