Skip to content

Similarity Graph ​

The Similarity Graph algorithm is the simplest baseline implementation of GraphRAG. It creates a graph where chunks are nodes and edges represent semantic similarity above a threshold.

Status: āœ… Available Now

Overview ​

Similarity Graph provides a straightforward approach to graph-enhanced RAG:

  1. Chunk documents into manageable pieces
  2. Embed chunks using your chosen embedding model
  3. Create edges between chunks with similarity above threshold
  4. Query: Vector search for seed chunks + BFS expansion

This creates a chunk similarity graph where related content is connected, enabling graph traversal to enrich retrieval context.

Installation ​

bash
pnpm add @graphrag-js/similarity

Quick Start ​

typescript
import { createGraph } from '@graphrag-js/core';
import { similarityGraph } from '@graphrag-js/similarity';
import { memoryGraph, memoryVector, memoryKV } from '@graphrag-js/memory';
import { openai } from '@ai-sdk/openai';

const graph = createGraph({
  model: openai('gpt-4o-mini'),
  embedding: openai.embedding('text-embedding-3-small'),
  provider: similarityGraph({
    similarityThreshold: 0.7,  // Only connect chunks with similarity > 0.7
  }),
  storage: {
    graph: memoryGraph,
    vector: memoryVector,
    kv: memoryKV,
  }
});

// Insert documents
await graph.insert([
  'GraphRAG enhances retrieval by using graph structures.',
  'Vector search finds similar documents based on embeddings.',
  'Graph traversal can discover related information.',
]);

// Query with graph expansion
const result = await graph.query('How does GraphRAG work?', {
  maxDepth: 2,  // BFS depth
  topK: 5,      // Number of seed chunks
});

console.log(result.text);

Configuration ​

similarityGraph(config) ​

typescript
interface SimilarityGraphConfig {
  similarityThreshold?: number;  // default: 0.7
}

Parameters:

  • similarityThreshold (default: 0.7)
    • Minimum cosine similarity to create an edge
    • Range: 0.0 to 1.0
    • Higher = sparser graph (fewer edges)
    • Lower = denser graph (more edges)

Recommended values:

  • 0.8-0.9: Very strict, only highly similar chunks
  • 0.7-0.8: Balanced (recommended)
  • 0.5-0.7: Looser connections
  • < 0.5: Too permissive, creates noise

Query Parameters ​

typescript
interface SimilarityQueryParams {
  maxDepth?: number;  // BFS expansion depth (default: 2)
  topK?: number;      // Number of seed nodes (default: 10)
}

maxDepth ​

Controls how far to traverse the graph from seed nodes:

  • 0: Pure vector search (no graph expansion)
  • 1: Immediate neighbors only
  • 2: Neighbors + neighbors-of-neighbors (recommended)
  • 3+: Deeper traversal (may include less relevant content)
typescript
// Vector search only (fastest)
await graph.query('question', { maxDepth: 0 });

// Graph expansion (better context)
await graph.query('question', { maxDepth: 2 });

topK ​

Number of seed chunks to retrieve from vector search:

  • Higher topK = more starting points
  • Lower topK = focused retrieval
typescript
// Focused retrieval
await graph.query('question', { topK: 5 });

// Broad retrieval
await graph.query('question', { topK: 20 });

How It Works ​

1. Document Insertion ​

Input: Documents
  ↓
Chunk into pieces
  ↓
Embed chunks (e.g., OpenAI text-embedding-3-small)
  ↓
Store vectors in vector index
  ↓
Create nodes in graph (one per chunk)
  ↓
Compute pairwise similarities
  ↓
Create edges where similarity > threshold
  ↓
Result: Chunk similarity graph

2. Query Processing ​

Input: Query
  ↓
Embed query
  ↓
Vector search → Top K seed chunks
  ↓
BFS expansion (maxDepth hops)
  ↓
Collect all visited chunks
  ↓
Send to LLM as context
  ↓
Result: Answer

3. BFS Expansion ​

Starting from seed chunks, the algorithm performs breadth-first search:

Depth 0: Seed chunks from vector search
  ↓
Depth 1: Neighbors of seeds (connected by similarity edges)
  ↓
Depth 2: Neighbors of neighbors
  ↓
...

Score decay: score(depth+1) = score(depth) * 0.9

Graph Structure ​

Nodes ​

Each chunk becomes a node with:

typescript
{
  nodeType: 'chunk',
  content: 'The actual chunk text...',
  // ... other metadata
}

Edges ​

Edges represent semantic similarity:

typescript
{
  edgeType: 'similar',
  weight: 0.85,  // Cosine similarity score
}

Edges are bidirectional - similarity is symmetric.

Usage Examples ​

Basic Usage ​

typescript
const graph = createGraph({
  model: openai('gpt-4o-mini'),
  embedding: openai.embedding('text-embedding-3-small'),
  provider: similarityGraph({ similarityThreshold: 0.7 }),
});

await graph.insert('Your documents...');
const result = await graph.query('Your question?');

Tuning Similarity Threshold ​

typescript
// Strict threshold: sparse graph, higher precision
const strictGraph = createGraph({
  provider: similarityGraph({ similarityThreshold: 0.85 }),
});

// Loose threshold: dense graph, higher recall
const looseGraph = createGraph({
  provider: similarityGraph({ similarityThreshold: 0.6 }),
});
typescript
// Pure vector search (fastest, baseline)
const vectorResult = await graph.query('question', {
  maxDepth: 0,
  topK: 10,
});

// Graph-enhanced search (better context)
const graphResult = await graph.query('question', {
  maxDepth: 2,
  topK: 10,
});

Batch Insertion ​

typescript
const documents = [
  'Document 1 content...',
  'Document 2 content...',
  'Document 3 content...',
];

await graph.insert(documents);

With Custom Storage ​

typescript
import { neo4jGraph } from '@graphrag-js/neo4j';
import { qdrantVector } from '@graphrag-js/qdrant';
import { redisKV } from '@graphrag-js/redis';

const graph = createGraph({
  provider: similarityGraph({ similarityThreshold: 0.75 }),
  storage: {
    graph: neo4jGraph({ url: 'bolt://localhost:7687', ... }),
    vector: qdrantVector({ url: 'http://localhost:6333' }),
    kv: redisKV({ host: 'localhost', port: 6379 }),
  },
});

Performance Characteristics ​

Time Complexity ​

OperationComplexityNotes
Insert N chunksO(N²)Pairwise similarity computation
QueryO(topK Ɨ depth Ɨ branching)BFS traversal
Vector searchO(log N)Approximate (HNSW)

Space Complexity ​

  • Nodes: O(N) where N = number of chunks
  • Edges: O(N²) worst case, O(N) average with threshold

Benchmarks ​

On 1000 chunks (1536-dim embeddings, M1 Mac):

OperationTime
Insert 1000 chunks~5-10s
Vector search (k=10)~5-20ms
BFS expansion (depth=2)~50-100ms
End-to-end query~1-2s (including LLM)

When to Use ​

āœ… Good For ​

  • Quick prototyping - Simple to set up and understand
  • Baseline comparisons - Measure improvement of more complex algorithms
  • Small to medium datasets - < 100K chunks
  • Simple retrieval - When entities/relationships aren't critical
  • Fast iteration - No expensive extraction phase

āŒ Not Ideal For ​

  • Entity-focused queries - Use LightRAG or Fast GraphRAG instead
  • Relationship queries - Doesn't model explicit relationships
  • Thematic analysis - Use Microsoft GraphRAG with communities
  • Multi-hop reasoning - Use AWS GraphRAG with fact chains
  • Very large datasets - O(N²) edge creation becomes slow

Comparison with Other Algorithms ​

FeatureSimilarityLightRAGMicrosoftFast
Setup complexity⭐ Lowest⭐⭐ Low⭐⭐⭐⭐ High⭐⭐⭐ Medium
Indexing cost$ Cheapest$$$ Medium$$$$$ Highest$$ Low
Query quality⭐⭐ Basic⭐⭐⭐⭐ Very Good⭐⭐⭐⭐⭐ Excellent⭐⭐⭐ Good
Entity awarenessāŒ Noāœ… Yesāœ… Yesāœ… Yes
Relationship modelingāŒ Noāœ… Yesāœ… Yesāœ… Yes
Community detectionāŒ NoāŒ Noāœ… YesāŒ No

Advanced Topics ​

Custom Similarity Metrics ​

The implementation uses cosine similarity by default. To use other metrics, you'd need to:

  1. Modify the vector store to use different distance metrics
  2. Adjust the threshold accordingly
typescript
// Using Euclidean distance (via custom storage)
const graph = createGraph({
  provider: similarityGraph({ similarityThreshold: 0.5 }),
  storage: {
    vector: customVectorStore({ metric: 'euclidean' }),
  },
});

Graph Analysis ​

typescript
// Get the underlying graph structure
const graphStore = graph.storage.graph;

// Analyze node connectivity
const degree = await graphStore.nodeDegree('chunk-123');

// Get all edges for a node
const edges = await graphStore.getNodeEdges('chunk-123');

Exporting the Graph ​

typescript
// Export to GraphML
const graphml = await graph.export('graphml');

// Export to JSON
const json = await graph.export('json');

Troubleshooting ​

Too Many/Too Few Edges ​

Problem: Graph is too sparse or too dense

Solution:

typescript
// Too sparse → lower threshold
const graph = createGraph({
  provider: similarityGraph({ similarityThreshold: 0.6 }),
});

// Too dense → raise threshold
const graph = createGraph({
  provider: similarityGraph({ similarityThreshold: 0.8 }),
});

Slow Insertion ​

Problem: O(N²) edge creation is slow

Solution:

  • Use smaller chunks (reduce N)
  • Increase threshold (fewer edges)
  • Use batching for large documents
  • Consider Fast GraphRAG for large datasets

Poor Retrieval Quality ​

Problem: Results not relevant

Solution:

  1. Adjust topK (try 5-20)
  2. Adjust maxDepth (try 1-3)
  3. Check similarity threshold
  4. Verify embedding quality
  5. Consider using entity-aware algorithms

Best Practices ​

  1. Start with defaults - threshold: 0.7, maxDepth: 2, topK: 10
  2. Tune threshold based on your domain and embedding model
  3. Use sparse graphs for better performance (higher threshold)
  4. Benchmark against pure vector search (maxDepth=0) to measure improvement
  5. Monitor graph density - adjust threshold if too many edges

Next Steps ​

Source Code ​

View the implementation:

Released under the Elastic License 2.0.