Manticore's 14x Speedup: How Code Rebuilds Drive Real-World Impact

Imagine waiting 14 times longer for your search results. That was the reality for some users of Manticore's Auto Embeddings. A deep dive into the engineering choices that delivered a stunning 14x speed boost for text-to-vector conversion.

DailyForage

Jul 3, 20264 min readTechnology Manticore Search ONNX

Manticore's 14x Speedup: How Code Rebuilds Drive Real-World Impact

Key takeaways

1When Manticore first rolled out its Auto Embeddings, the ambition was clear: provide a seamless way to transform text into vectors without needing external model services.
2Faced with this bottleneck, the Manticore engineering team made a decisive move: a complete rebuild of the embeddings path using ONNX (Open Neural Network Exchange).
3Some might dismiss this as a purely technical detail, a number on a benchmark.
4Manticore Search achieved a 14x speed improvement for its Auto Embeddings feature.

Picture a complex search engine, tasked with understanding billions of documents. For a while, Manticore Search, a name known for its raw power, faced a frustrating bottleneck. Their 'Auto Embeddings' feature, designed to instantly turn text into smart, searchable vectors, was lagging, often stuck in the low-double-digits of documents per second. This wasn't just a technical glitch; it was a drag on real-world efficiency, slowing down the very systems designed to make information accessible.

The Frustration of Underutilized Power

When Manticore first rolled out its Auto Embeddings, the ambition was clear: provide a seamless way to transform text into vectors without needing external model services. The initial implementation, however, relied on SentenceTransformers running atop Candle, Hugging Face's pure-Rust ML inference runtime. It quickly became apparent that this setup was leaving significant computational power on the table.

The CPU sat largely idle, and concurrent processing requests often serialized, meaning tasks ran one after another instead of in parallel. This underperformance wasn't just an inconvenience for developers; it translated directly to slower data ingestion, delayed search results, and higher operational costs for anyone trying to process large volumes of text. It's a classic example of a powerful tool hobbled by its underlying architecture.

The previous path went through SentenceTransformers on top of Candle, Hugging Face's pure-Rust ML inference runtime, and it left a lot of CPU on the floor: most workloads sat in the low-double-digits of docs/sec no matter how we fed them.

A Surgical Rebuild: The ONNX Solution

Faced with this bottleneck, the Manticore engineering team made a decisive move: a complete rebuild of the embeddings path using ONNX (Open Neural Network Exchange). ONNX provides an open standard for representing machine learning models, allowing them to be run efficiently on various hardware and software platforms through its dedicated runtime.

This wasn't a superficial patch; it was a fundamental architectural shift. By integrating ONNX Runtime directly, Manticore bypassed the overheads of the previous setup, enabling better hardware utilization and efficient batch processing. The result? A stunning 14x speed improvement for text embeddings, transforming a sluggish process into one that truly lives up to the promise of real-time vectorization.

📌 Key Point: The shift to ONNX Runtime wasn't just swapping libraries; it was a fundamental architectural change, allowing Manticore to process text embeddings 14 times faster by directly optimizing for hardware.

Beyond Benchmarks: Real-World Consequences

Some might dismiss this as a purely technical detail, a number on a benchmark. But in the world of data and information, a 14x speedup has profound real-world consequences. Imagine the difference it makes for a company processing customer feedback, news articles, or legal documents. What was once a day-long task could now be completed in hours, or even minutes.

This isn't just about faster computers; it's about faster insights, quicker decision-making, and significantly reduced operational expenditure. When a core component like text embedding becomes this efficient, it frees up resources, both human and machine, to tackle more complex problems. It directly impacts the quality and responsiveness of applications that rely on understanding vast amounts of text.

Here's what a 14x speedup means on the ground:

Reduced operational costs for data processing and infrastructure.
Near real-time semantic search capabilities in applications.
Ability to scale embedding workloads without proportional hardware increase.
Improved user experience in applications relying on deep text understanding.

Key Facts

Manticore Search achieved a 14x speed improvement for its Auto Embeddings feature.
The original system struggled with low-double-digits of documents per second due to CPU underutilization.
The new, optimized path leverages ONNX Runtime for efficient model inference.
This enhancement allows automatic text-to-vector conversion without needing separate model services.

Conclusion

The story of Manticore's ONNX rebuild is a testament to the fact that even in highly specialized technical domains, fundamental engineering choices have tangible impacts on performance and, by extension, on real-world utility. It shows that sometimes, the most significant leaps forward come not from inventing entirely new technologies, but from meticulously optimizing existing ones. As data volumes continue to explode, how many other foundational systems are leaving similar performance on the table, waiting for a focused engineering effort to unlock their true potential?

FAQ

What is 'Auto Embeddings' in Manticore? Auto Embeddings is a Manticore feature that automatically converts text columns into vector representations, enabling semantic search without needing a separate model service.
Why was the old embeddings path slow? The previous path, using SentenceTransformers on Candle, suffered from CPU underutilization and serialization issues, preventing efficient parallel processing of documents.
What is ONNX Runtime? ONNX Runtime is a high-performance inference engine for ONNX models, allowing machine learning models to run efficiently across different hardware and operating systems.
What is the main benefit of this 14x speedup? The primary benefit is significantly faster text processing, leading to reduced operational costs, quicker insights, and more responsive applications that rely on understanding large volumes of text.

4 min read · 827 words

Share this article

Found this useful? Share it with your friends and followers.

Rate this article

Discussion

Manticore's 14x Speed Boost: Rebuilding ONNX for Faster Embeddings

Imagine your search results appearing 14 times faster. Manticore Search just achieved this by completely rebuilding its text embedding process, turning slow AI features into lightning-quick insights for developers.

DailyForage · 4 min readRead