Semantic Property Matching with ChromaDB

2025-07-14

Why Keyword Search Falls Short for Real Estate

When users say things like “close to downtown” or “lots of sunlight,” they’re expressing intent — not filters.

Traditional keyword search or SQL-based filtering often misses the mark:

  • It doesn’t understand synonyms or implied meaning
  • It can’t rank properties by vibe or fit
  • It relies too heavily on exact field matches

We wanted a recommendation engine that thinks in ideas, not just fields. That’s why we chose semantic search with ChromaDB.

🎥 Watch Part 2: Semantic Search and RAG
📞 Want something like this? Schedule a call


We store all listings as vector embeddings, using OpenAI’s text-embedding-3-small model.

The file load_listings.py does the heavy lifting:

with open(config.LISTINGS_JSON_PATH, "r") as f:
    listings = json.load(f)

docs = []
for listing in listings:
    text = generate_property_text(listing)
    doc = Document(
        text=text,
        metadata={
            "id": listing["id"],
            "price": listing["price"],
            "zip": listing["zip"],
            "bedrooms": listing["bedrooms"],
            "bathrooms": listing["bathrooms"],
            "sqft": listing["sqft"],
        },
    )
    docs.append(doc)

chroma.add(documents=docs)

This code:

  • Loads the listings
  • Generates conversational summaries
  • Embeds them and inserts into ChromaDB

Designing the recommend_properties Tool

When the voice agent needs to suggest homes, it calls recommend_properties().

Here’s what happens under the hood:

search_text = profile_to_text(user_profile)
search_vector = get_embedding(search_text)

results = chroma.query(
    query_embeddings=[search_vector],
    n_results=top_k,
    where=metadata_filters,
)

top_matches = parse_chroma_results(results)
  • Converts profile → natural language
  • Embeds using OpenAI
  • Queries ChromaDB with metadata filters
  • Parses and returns top listings

This flow is voice-optimized and stateless.

Validating and Normalizing User Preferences

Before we generate embeddings, we make sure the user’s input is structured and clean. For example:

  • If the user says “budget is 450k”, we convert it to 450000
  • Phone numbers and dates are validated and normalized
  • Missing fields like square footage are filled with defaults (e.g., 2000 sqft)

This ensures our filters (e.g., budget range, bedrooms) work accurately during the ChromaDB query. We use a Pydantic model called UserProfile and helper functions to apply validation and defaults.


Crafting Natural Property Descriptions for Voice

We made listings sound human with:

def generate_property_text(listing: dict) -> str:
    text = f"A {listing['bedrooms']}-bedroom, {listing['bathrooms']}-bathroom home"
    if listing.get("neighborhood"):
        text += f" in {listing['neighborhood']}"
    if listing.get("price"):
        text += f", listed at ${listing['price']:,}"
    return text + "."

Example:

Raw JSON:

{
  "price": 420000,
  "bedrooms": 3,
  "neighborhood": "Logan Square",
  "description": "Charming, updated home near train and parks."
}

Generated summary:

“A 3-bedroom, 2-bathroom home in Logan Square, listed at $420,000.”

These summaries power both search and voice.

Ranking and Filtering the Results

We use a two-step filtering and ranking approach:

  1. Metadata Pre-filtering — We apply hard constraints like:

    • Budget range
    • Bedroom and bathroom count
    • Zip code (if specified)
  2. Semantic Similarity Ranking — After filtering, we embed the user query and compare it against all candidate properties using cosine similarity.

We return the top 3 matches (top_k = 3), sorted by how close their embeddings are to the user’s intent.

You can fine-tune this further by giving more weight to listings with:

  • Richer descriptions
  • More recent updates
  • Certain preferred features (e.g., garage, backyard)

Prompt Flow for Recommendations

The system prompt guides the agent to:

  • Offer one property at a time
  • Speak in plain language
  • Transition only after interest

Script flow:

Agent: I found a 2-bedroom with a big backyard near the train. Want to hear another or book a visit?

Example Walkthrough: From Fuzzy Query to Spoken Match

Let’s say the user says:

“Looking for something cozy around 450 in Logan Square.”

The code:

search_text = "Looking for a cozy home in Logan Square, around $450,000"
vector = get_embedding(search_text)
results = chroma.query(query_embeddings=[vector], n_results=top_k)

Agent says:

“Here’s one: a sunlit 2-bed with a modern kitchen in Logan Square, listed at $445k.”


Why It Works

By combining:

  • Voice → structured profile
  • Profile → embeddings
  • Chroma → vector query
  • Results → prompt-shaped replies

We bridge AI search with natural voice UX.

How Everything Fits Together

Here’s the high-level flow of how user preferences become recommendations:

Voice Input

Agent collects preferences (location, budget, etc.)

UserProfile → Text Summary

OpenAI Embedding

ChromaDB Vector Query (with filters)

Top Matches (sorted by similarity)

Agent formats and speaks response

🧠 This flow bridges natural language intent with structured property listings — and returns conversational, human-friendly responses.


Lessons Learned and Future Improvements

Building this system taught us a few important things:

  • 🧭 Prompting matters. Early versions overwhelmed users with 3 listings at once — now we prompt the agent to offer just one and ask if they want more.
  • 🔍 Voice interaction reveals friction fast. What sounds great in a chat UI can feel clunky on a call. We had to rewrite summaries and simplify flows to sound natural.
  • ⚙️ Ranking is subjective. Semantic search helps a lot, but future versions could add user feedback loops (“👍 this listing?”) to improve results over time.

We’re excited to extend this into outbound lead calls, multi-property follow-ups, and even chatbot interfaces — all powered by the same semantic engine.


Watch It in Action

🎥 Watch Part 2: Semantic Search and RAG
💻 See the code on GitHub
📞 Want something like this? Schedule a call


Follow the Series

Read Part 2: Agent Architecture and Prompt Engineering