Similarity Graph

The Similarity Graph algorithm is the simplest baseline implementation of GraphRAG. It creates a graph where chunks are nodes and edges represent semantic similarity above a threshold.

Status: ✅ Available Now

Overview

Similarity Graph provides a straightforward approach to graph-enhanced RAG:

Chunk documents into manageable pieces
Embed chunks using your chosen embedding model
Create edges between chunks with similarity above threshold
Query: Vector search for seed chunks + BFS expansion

This creates a chunk similarity graph where related content is connected, enabling graph traversal to enrich retrieval context.

Installation

bash

pnpm add @graphrag-js/similarity

Quick Start

typescript

import { createGraph } from '@graphrag-js/core';
import { similarityGraph } from '@graphrag-js/similarity';
import { memoryGraph, memoryVector, memoryKV } from '@graphrag-js/memory';
import { openai } from '@ai-sdk/openai';

const graph = createGraph({
  model: openai('gpt-4o-mini'),
  embedding: openai.embedding('text-embedding-3-small'),
  provider: similarityGraph({
    similarityThreshold: 0.7,  // Only connect chunks with similarity > 0.7
  }),
  storage: {
    graph: memoryGraph,
    vector: memoryVector,
    kv: memoryKV,
  }
});

// Insert documents
await graph.insert([
  'GraphRAG enhances retrieval by using graph structures.',
  'Vector search finds similar documents based on embeddings.',
  'Graph traversal can discover related information.',
]);

// Query with graph expansion
const result = await graph.query('How does GraphRAG work?', {
  maxDepth: 2,  // BFS depth
  topK: 5,      // Number of seed chunks
});

console.log(result.text);

Configuration

`similarityGraph(config)`

typescript

interface SimilarityGraphConfig {
  similarityThreshold?: number;  // default: 0.7
}

Parameters:

similarityThreshold (default: 0.7)
- Minimum cosine similarity to create an edge
- Range: 0.0 to 1.0
- Higher = sparser graph (fewer edges)
- Lower = denser graph (more edges)

Recommended values:

0.8-0.9: Very strict, only highly similar chunks
0.7-0.8: Balanced (recommended)
0.5-0.7: Looser connections
< 0.5: Too permissive, creates noise

Query Parameters

typescript

interface SimilarityQueryParams {
  maxDepth?: number;  // BFS expansion depth (default: 2)
  topK?: number;      // Number of seed nodes (default: 10)
}

`maxDepth`

Controls how far to traverse the graph from seed nodes:

0: Pure vector search (no graph expansion)
1: Immediate neighbors only
2: Neighbors + neighbors-of-neighbors (recommended)
3+: Deeper traversal (may include less relevant content)

typescript

// Vector search only (fastest)
await graph.query('question', { maxDepth: 0 });

// Graph expansion (better context)
await graph.query('question', { maxDepth: 2 });

`topK`

Number of seed chunks to retrieve from vector search:

Higher topK = more starting points
Lower topK = focused retrieval

typescript

// Focused retrieval
await graph.query('question', { topK: 5 });

// Broad retrieval
await graph.query('question', { topK: 20 });

How It Works

1. Document Insertion

Input: Documents
  ↓
Chunk into pieces
  ↓
Embed chunks (e.g., OpenAI text-embedding-3-small)
  ↓
Store vectors in vector index
  ↓
Create nodes in graph (one per chunk)
  ↓
Compute pairwise similarities
  ↓
Create edges where similarity > threshold
  ↓
Result: Chunk similarity graph

2. Query Processing

Input: Query
  ↓
Embed query
  ↓
Vector search → Top K seed chunks
  ↓
BFS expansion (maxDepth hops)
  ↓
Collect all visited chunks
  ↓
Send to LLM as context
  ↓
Result: Answer

3. BFS Expansion

Starting from seed chunks, the algorithm performs breadth-first search:

Depth 0: Seed chunks from vector search
  ↓
Depth 1: Neighbors of seeds (connected by similarity edges)
  ↓
Depth 2: Neighbors of neighbors
  ↓
...

Score decay: score(depth+1) = score(depth) * 0.9

Graph Structure

Nodes

Each chunk becomes a node with:

typescript

{
  nodeType: 'chunk',
  content: 'The actual chunk text...',
  // ... other metadata
}

Edges

Edges represent semantic similarity:

typescript

{
  edgeType: 'similar',
  weight: 0.85,  // Cosine similarity score
}

Edges are bidirectional - similarity is symmetric.

Usage Examples

Basic Usage

typescript

const graph = createGraph({
  model: openai('gpt-4o-mini'),
  embedding: openai.embedding('text-embedding-3-small'),
  provider: similarityGraph({ similarityThreshold: 0.7 }),
});

await graph.insert('Your documents...');
const result = await graph.query('Your question?');

Tuning Similarity Threshold

typescript

// Strict threshold: sparse graph, higher precision
const strictGraph = createGraph({
  provider: similarityGraph({ similarityThreshold: 0.85 }),
});

// Loose threshold: dense graph, higher recall
const looseGraph = createGraph({
  provider: similarityGraph({ similarityThreshold: 0.6 }),
});

Vector Search vs Graph Search

typescript

// Pure vector search (fastest, baseline)
const vectorResult = await graph.query('question', {
  maxDepth: 0,
  topK: 10,
});

// Graph-enhanced search (better context)
const graphResult = await graph.query('question', {
  maxDepth: 2,
  topK: 10,
});

Batch Insertion

typescript

const documents = [
  'Document 1 content...',
  'Document 2 content...',
  'Document 3 content...',
];

await graph.insert(documents);

With Custom Storage

typescript

import { neo4jGraph } from '@graphrag-js/neo4j';
import { qdrantVector } from '@graphrag-js/qdrant';
import { redisKV } from '@graphrag-js/redis';

const graph = createGraph({
  provider: similarityGraph({ similarityThreshold: 0.75 }),
  storage: {
    graph: neo4jGraph({ url: 'bolt://localhost:7687', ... }),
    vector: qdrantVector({ url: 'http://localhost:6333' }),
    kv: redisKV({ host: 'localhost', port: 6379 }),
  },
});

Performance Characteristics

Time Complexity

Operation	Complexity	Notes
Insert N chunks	O(N²)	Pairwise similarity computation
Query	O(topK × depth × branching)	BFS traversal
Vector search	O(log N)	Approximate (HNSW)

Space Complexity

Nodes: O(N) where N = number of chunks
Edges: O(N²) worst case, O(N) average with threshold

Benchmarks

On 1000 chunks (1536-dim embeddings, M1 Mac):

Operation	Time
Insert 1000 chunks	~5-10s
Vector search (k=10)	~5-20ms
BFS expansion (depth=2)	~50-100ms
End-to-end query	~1-2s (including LLM)

When to Use

✅ Good For

Quick prototyping - Simple to set up and understand
Baseline comparisons - Measure improvement of more complex algorithms
Small to medium datasets - < 100K chunks
Simple retrieval - When entities/relationships aren't critical
Fast iteration - No expensive extraction phase

❌ Not Ideal For

Entity-focused queries - Use LightRAG or Fast GraphRAG instead
Relationship queries - Doesn't model explicit relationships
Thematic analysis - Use Microsoft GraphRAG with communities
Multi-hop reasoning - Use AWS GraphRAG with fact chains
Very large datasets - O(N²) edge creation becomes slow

Comparison with Other Algorithms

Feature	Similarity	LightRAG	Microsoft	Fast
Setup complexity	⭐ Lowest	⭐⭐ Low	⭐⭐⭐⭐ High	⭐⭐⭐ Medium
Indexing cost	$ Cheapest	$$$ Medium	$$$$$ Highest	$$ Low
Query quality	⭐⭐ Basic	⭐⭐⭐⭐ Very Good	⭐⭐⭐⭐⭐ Excellent	⭐⭐⭐ Good
Entity awareness	❌ No	✅ Yes	✅ Yes	✅ Yes
Relationship modeling	❌ No	✅ Yes	✅ Yes	✅ Yes
Community detection	❌ No	❌ No	✅ Yes	❌ No

Advanced Topics

Custom Similarity Metrics

The implementation uses cosine similarity by default. To use other metrics, you'd need to:

Modify the vector store to use different distance metrics
Adjust the threshold accordingly

typescript

// Using Euclidean distance (via custom storage)
const graph = createGraph({
  provider: similarityGraph({ similarityThreshold: 0.5 }),
  storage: {
    vector: customVectorStore({ metric: 'euclidean' }),
  },
});

Graph Analysis

typescript

// Get the underlying graph structure
const graphStore = graph.storage.graph;

// Analyze node connectivity
const degree = await graphStore.nodeDegree('chunk-123');

// Get all edges for a node
const edges = await graphStore.getNodeEdges('chunk-123');

Exporting the Graph

typescript

// Export to GraphML
const graphml = await graph.export('graphml');

// Export to JSON
const json = await graph.export('json');

Troubleshooting

Too Many/Too Few Edges

Problem: Graph is too sparse or too dense

Solution:

typescript

// Too sparse → lower threshold
const graph = createGraph({
  provider: similarityGraph({ similarityThreshold: 0.6 }),
});

// Too dense → raise threshold
const graph = createGraph({
  provider: similarityGraph({ similarityThreshold: 0.8 }),
});

Slow Insertion

Problem: O(N²) edge creation is slow

Solution:

Use smaller chunks (reduce N)
Increase threshold (fewer edges)
Use batching for large documents
Consider Fast GraphRAG for large datasets

Poor Retrieval Quality

Problem: Results not relevant

Solution:

Adjust topK (try 5-20)
Adjust maxDepth (try 1-3)
Check similarity threshold
Verify embedding quality
Consider using entity-aware algorithms

Best Practices

Start with defaults - threshold: 0.7, maxDepth: 2, topK: 10
Tune threshold based on your domain and embedding model
Use sparse graphs for better performance (higher threshold)
Benchmark against pure vector search (maxDepth=0) to measure improvement
Monitor graph density - adjust threshold if too many edges

Next Steps

Algorithm Overview - Compare all algorithms
LightRAG - Next step up in complexity 🚧
Storage Options - Choose your backend
API Reference - Full API documentation

Source Code

View the implementation:

Similarity Graph ​

Overview ​

Installation ​

Quick Start ​

Configuration ​

similarityGraph(config) ​

Query Parameters ​

maxDepth ​

topK ​

How It Works ​

1. Document Insertion ​

2. Query Processing ​

3. BFS Expansion ​

Graph Structure ​

Nodes ​

Edges ​

Usage Examples ​

Basic Usage ​

Tuning Similarity Threshold ​

Vector Search vs Graph Search ​

Batch Insertion ​

With Custom Storage ​

Performance Characteristics ​

Time Complexity ​

Space Complexity ​

Benchmarks ​

When to Use ​

✅ Good For ​

❌ Not Ideal For ​

Comparison with Other Algorithms ​

Advanced Topics ​

Custom Similarity Metrics ​

Graph Analysis ​

Exporting the Graph ​

Troubleshooting ​

Too Many/Too Few Edges ​

Slow Insertion ​

Poor Retrieval Quality ​

Best Practices ​

Next Steps ​

Source Code ​

Similarity Graph

Overview

Installation

Quick Start

Configuration

`similarityGraph(config)`

Query Parameters

`maxDepth`

`topK`

How It Works

1. Document Insertion

2. Query Processing

3. BFS Expansion

Graph Structure

Nodes

Edges

Usage Examples

Basic Usage

Tuning Similarity Threshold

Vector Search vs Graph Search

Batch Insertion

With Custom Storage

Performance Characteristics

Time Complexity

Space Complexity

Benchmarks

When to Use

✅ Good For

❌ Not Ideal For

Comparison with Other Algorithms

Advanced Topics

Custom Similarity Metrics

Graph Analysis

Exporting the Graph

Troubleshooting

Too Many/Too Few Edges

Slow Insertion

Poor Retrieval Quality

Best Practices

Next Steps

Source Code