Manticore's 14x Speed Boost: Rebuilding ONNX for Faster Embeddings

Imagine your search results appearing 14 times faster. Manticore Search just achieved this by completely rebuilding its text embedding process, turning slow AI features into lightning-quick insights for developers.

DailyForage

Jul 3, 20264 min readTechnology Manticore Search text embeddings

Manticore's 14x Speed Boost: Rebuilding ONNX for Faster Embeddings

Key takeaways

1Text embeddings are the unsung heroes behind smart search and AI applications.
2The team at Manticore didn't just tweak things; they went back to the drawing board.
3Moving from low-double-digits to processing hundreds of documents per second changes everything.
4Manticore's text embeddings are now 14 times faster than the previous implementation.

Imagine trying to find a needle in a haystack, but the haystack keeps getting bigger and the search tools are sluggish. That's a bit like the challenge Manticore Search users faced when their brilliant 'Auto Embeddings' feature – which turns any text into a searchable vector – sometimes felt less 'auto' and more 'wait-a-minute'. The feedback was clear: speed was an issue. We're talking low-double-digits of documents processed per second, a bottleneck that left a lot of computing power on the table.

The Embedding Speed Bump: Why It Mattered

Text embeddings are the unsung heroes behind smart search and AI applications. They convert human language into numerical vectors, allowing computers to understand relationships between words and phrases, powering everything from semantic search to recommendation engines. When Manticore first rolled out its Auto Embeddings, it was a fantastic leap: no need for a separate, resource-heavy model service. It ran directly within Manticore, using SentenceTransformers on top of Candle, Hugging Face's pure-Rust ML inference runtime.

While innovative, this setup had its limitations. Users, including Dmitrii Kuzmenkov, the engineer behind the recent overhaul, found that even with ample CPU, the processing rate hovered around 10-20 documents per second. This meant that for anyone dealing with large datasets, the promise of automatic, intelligent search was hampered by the practical reality of waiting. It wasn't about a lack of power; it was about how that power was being utilized.

The biggest frustration was seeing so much CPU sitting idle. We knew Manticore could do better, and our users deserved that efficiency.

From Candle to ONNX: Manticore's Engineering Overhaul

The team at Manticore didn't just tweak things; they went back to the drawing board. The core problem was the serialization of concurrent calls on a single thread within the Candle-based path. It simply wasn't built for the kind of parallel processing modern search demands. The solution? A complete rebuild, shifting the core inference engine to ONNX (Open Neural Network Exchange).

ONNX isn't just another acronym; it's an open standard that allows developers to move machine learning models between different frameworks. This flexibility, crucially, enables models to run with highly optimized inference engines. By adopting ONNX, Manticore could integrate a much more efficient execution environment, specifically designed to handle parallel workloads and leverage CPU resources far more effectively. This wasn't a small change; it required rewriting significant portions of the embedding pipeline to fully capitalize on ONNX's capabilities.

📌 Key Point: Manticore's Auto Embeddings now run 14 times faster without requiring any external model service, keeping everything self-contained and streamlined.

What 14x Faster Really Means for You

Moving from low-double-digits to processing hundreds of documents per second changes everything. For developers and businesses using Manticore, this speedup translates directly into more responsive applications and the ability to handle significantly larger data volumes without compromising performance. It means real-time semantic search becomes a practical reality, not just a theoretical possibility.

This isn't just an incremental improvement; it's a fundamental shift in how quickly you can index and search with AI-powered understanding. Imagine updating your product catalog and having its new descriptions instantly available for semantic search, or processing user queries with unparalleled speed. The benefits ripple through the entire application stack.

Here's what this dramatic speedup delivers:

Rapid Data Ingestion: Index new text data with embeddings in a fraction of the time.
Real-Time Semantic Search: Provide highly relevant search results almost instantaneously.
Scalability: Process larger datasets and higher query volumes without performance bottlenecks.
Resource Efficiency: Make better use of existing hardware, reducing operational costs.

Key Facts

Manticore's text embeddings are now 14 times faster than the previous implementation.
The old system processed text at 10-20 documents per second.
The new ONNX path can handle hundreds of documents per second.
This optimization was achieved by replacing SentenceTransformers on Candle with an ONNX-based solution.

Conclusion

The journey from identifying a performance bottleneck to rebuilding a core feature for a 14x speedup is a testament to focused engineering. It reminds us that even the most innovative features can be refined for greater efficiency and user benefit. What other hidden performance gains are waiting to be uncovered in the complex world of data processing?

4 min read · 707 words

Share this article

Found this useful? Share it with your friends and followers.

Rate this article

Discussion

Manticore's 14x Speedup: How Code Rebuilds Drive Real-World Impact

Imagine waiting 14 times longer for your search results. That was the reality for some users of Manticore's Auto Embeddings. A deep dive into the engineering choices that delivered a stunning 14x speed boost for text-to-vector conversion.

DailyForage · 4 min readRead