Vicco LabsVicco Labs
Building a production conversational assistant · Part 1
Redis Stack as an LLM retrieval layer

Have you used Redis for more than simple caching?

Cache miss became a slow API call, p99 climbed, LLM cost climbed. That's where I discovered Redis Stack as a deterministic retrieval and analytics layer for LLM applications.

4 MAR 2026·3 min read·Redis / RAG / LLM / Cache
REDIS

Until recently, that's exactly what Redis was for me: GET/SET, TTL, expiration, and life moves on.

Then I started using it as a persistent cache for the embedding matrices of a semantic router (via RedisVectorStore), keyed by fingerprint + intent, with compressed NPZ. Then I added Redis Checkpointer for LangGraph, persisting state across runs and threads. It worked well… but it was just the beginning of my approach to Redis.

When a conversational assistant (with LLM) starts depending on heavy APIs and large stores (product catalogs, preference history, events…), the bill shows up fast:

cache miss → slow API call → p99 climbs → UX suffers → LLM cost climbs = bad user experience.

That's where I discovered Redis Stack as a deterministic retrieval and analytics layer for LLM applications. What surprised me most was seeing it deliver, for free (or close to it), things I used to think required a separate vector DB or Elasticsearch:

  • native fuzzy search (tolerating "teniz" → "tênis", "cdbi ipca" → "CDB IPCA");
  • vector (embedding) search for semantic similarity;
  • powerful structured filters (tags, numeric ranges, full-text);
  • real-time analytics (metrics, events, time series).

All running inside the same Redis, without standing up a parallel infrastructure.

Where this can fit

  • e-commerce: similar products + price range + available stock;
  • customer support: finding relevant tickets or FAQs even with typos;
  • logistics: detecting delay spikes and correlating with routes/events in real time;
  • media/content: hybrid recommendation (semantic + tags + recency).

But there's no free lunch

The trade-offs show up fast:

  • RAM and indexes are expensive (especially when you load vectors + heavy indexing);
  • tuning and NoSQL modeling require discipline (fat, slim, hybrid - each has its price);
  • ingestion and updates can become a bottleneck if volume is high and indexing isn't designed well.

Even so, when the core pain is latency + LLM cost + excessive model dependency, it's worth analyzing case by case. In my case, it completely changed the game.

Next week I'll talk about something that paired very well with Redis Stack: data modeling Fat vs Slim vs Hybrid - how each model affects cost, performance, and flexibility (and which one I ended up using to solve my problem).

Were you already familiar with Redis Stack (RedisJSON + RediSearch + Vector)? How did you use it and what was the biggest trade-off you faced?