Similarity Graph ā
The Similarity Graph algorithm is the simplest baseline implementation of GraphRAG. It creates a graph where chunks are nodes and edges represent semantic similarity above a threshold.
Status: ā Available Now
Overview ā
Similarity Graph provides a straightforward approach to graph-enhanced RAG:
- Chunk documents into manageable pieces
- Embed chunks using your chosen embedding model
- Create edges between chunks with similarity above threshold
- Query: Vector search for seed chunks + BFS expansion
This creates a chunk similarity graph where related content is connected, enabling graph traversal to enrich retrieval context.
Installation ā
pnpm add @graphrag-js/similarityQuick Start ā
import { createGraph } from '@graphrag-js/core';
import { similarityGraph } from '@graphrag-js/similarity';
import { memoryGraph, memoryVector, memoryKV } from '@graphrag-js/memory';
import { openai } from '@ai-sdk/openai';
const graph = createGraph({
model: openai('gpt-4o-mini'),
embedding: openai.embedding('text-embedding-3-small'),
provider: similarityGraph({
similarityThreshold: 0.7, // Only connect chunks with similarity > 0.7
}),
storage: {
graph: memoryGraph,
vector: memoryVector,
kv: memoryKV,
}
});
// Insert documents
await graph.insert([
'GraphRAG enhances retrieval by using graph structures.',
'Vector search finds similar documents based on embeddings.',
'Graph traversal can discover related information.',
]);
// Query with graph expansion
const result = await graph.query('How does GraphRAG work?', {
maxDepth: 2, // BFS depth
topK: 5, // Number of seed chunks
});
console.log(result.text);Configuration ā
similarityGraph(config) ā
interface SimilarityGraphConfig {
similarityThreshold?: number; // default: 0.7
}Parameters:
similarityThreshold(default:0.7)- Minimum cosine similarity to create an edge
- Range:
0.0to1.0 - Higher = sparser graph (fewer edges)
- Lower = denser graph (more edges)
Recommended values:
0.8-0.9: Very strict, only highly similar chunks0.7-0.8: Balanced (recommended)0.5-0.7: Looser connections< 0.5: Too permissive, creates noise
Query Parameters ā
interface SimilarityQueryParams {
maxDepth?: number; // BFS expansion depth (default: 2)
topK?: number; // Number of seed nodes (default: 10)
}maxDepth ā
Controls how far to traverse the graph from seed nodes:
0: Pure vector search (no graph expansion)1: Immediate neighbors only2: Neighbors + neighbors-of-neighbors (recommended)3+: Deeper traversal (may include less relevant content)
// Vector search only (fastest)
await graph.query('question', { maxDepth: 0 });
// Graph expansion (better context)
await graph.query('question', { maxDepth: 2 });topK ā
Number of seed chunks to retrieve from vector search:
- Higher topK = more starting points
- Lower topK = focused retrieval
// Focused retrieval
await graph.query('question', { topK: 5 });
// Broad retrieval
await graph.query('question', { topK: 20 });How It Works ā
1. Document Insertion ā
Input: Documents
ā
Chunk into pieces
ā
Embed chunks (e.g., OpenAI text-embedding-3-small)
ā
Store vectors in vector index
ā
Create nodes in graph (one per chunk)
ā
Compute pairwise similarities
ā
Create edges where similarity > threshold
ā
Result: Chunk similarity graph2. Query Processing ā
Input: Query
ā
Embed query
ā
Vector search ā Top K seed chunks
ā
BFS expansion (maxDepth hops)
ā
Collect all visited chunks
ā
Send to LLM as context
ā
Result: Answer3. BFS Expansion ā
Starting from seed chunks, the algorithm performs breadth-first search:
Depth 0: Seed chunks from vector search
ā
Depth 1: Neighbors of seeds (connected by similarity edges)
ā
Depth 2: Neighbors of neighbors
ā
...Score decay: score(depth+1) = score(depth) * 0.9
Graph Structure ā
Nodes ā
Each chunk becomes a node with:
{
nodeType: 'chunk',
content: 'The actual chunk text...',
// ... other metadata
}Edges ā
Edges represent semantic similarity:
{
edgeType: 'similar',
weight: 0.85, // Cosine similarity score
}Edges are bidirectional - similarity is symmetric.
Usage Examples ā
Basic Usage ā
const graph = createGraph({
model: openai('gpt-4o-mini'),
embedding: openai.embedding('text-embedding-3-small'),
provider: similarityGraph({ similarityThreshold: 0.7 }),
});
await graph.insert('Your documents...');
const result = await graph.query('Your question?');Tuning Similarity Threshold ā
// Strict threshold: sparse graph, higher precision
const strictGraph = createGraph({
provider: similarityGraph({ similarityThreshold: 0.85 }),
});
// Loose threshold: dense graph, higher recall
const looseGraph = createGraph({
provider: similarityGraph({ similarityThreshold: 0.6 }),
});Vector Search vs Graph Search ā
// Pure vector search (fastest, baseline)
const vectorResult = await graph.query('question', {
maxDepth: 0,
topK: 10,
});
// Graph-enhanced search (better context)
const graphResult = await graph.query('question', {
maxDepth: 2,
topK: 10,
});Batch Insertion ā
const documents = [
'Document 1 content...',
'Document 2 content...',
'Document 3 content...',
];
await graph.insert(documents);With Custom Storage ā
import { neo4jGraph } from '@graphrag-js/neo4j';
import { qdrantVector } from '@graphrag-js/qdrant';
import { redisKV } from '@graphrag-js/redis';
const graph = createGraph({
provider: similarityGraph({ similarityThreshold: 0.75 }),
storage: {
graph: neo4jGraph({ url: 'bolt://localhost:7687', ... }),
vector: qdrantVector({ url: 'http://localhost:6333' }),
kv: redisKV({ host: 'localhost', port: 6379 }),
},
});Performance Characteristics ā
Time Complexity ā
| Operation | Complexity | Notes |
|---|---|---|
| Insert N chunks | O(N²) | Pairwise similarity computation |
| Query | O(topK Ć depth Ć branching) | BFS traversal |
| Vector search | O(log N) | Approximate (HNSW) |
Space Complexity ā
- Nodes: O(N) where N = number of chunks
- Edges: O(N²) worst case, O(N) average with threshold
Benchmarks ā
On 1000 chunks (1536-dim embeddings, M1 Mac):
| Operation | Time |
|---|---|
| Insert 1000 chunks | ~5-10s |
| Vector search (k=10) | ~5-20ms |
| BFS expansion (depth=2) | ~50-100ms |
| End-to-end query | ~1-2s (including LLM) |
When to Use ā
ā Good For ā
- Quick prototyping - Simple to set up and understand
- Baseline comparisons - Measure improvement of more complex algorithms
- Small to medium datasets - < 100K chunks
- Simple retrieval - When entities/relationships aren't critical
- Fast iteration - No expensive extraction phase
ā Not Ideal For ā
- Entity-focused queries - Use LightRAG or Fast GraphRAG instead
- Relationship queries - Doesn't model explicit relationships
- Thematic analysis - Use Microsoft GraphRAG with communities
- Multi-hop reasoning - Use AWS GraphRAG with fact chains
- Very large datasets - O(N²) edge creation becomes slow
Comparison with Other Algorithms ā
| Feature | Similarity | LightRAG | Microsoft | Fast |
|---|---|---|---|---|
| Setup complexity | ā Lowest | āā Low | āāāā High | āāā Medium |
| Indexing cost | $ Cheapest | $$$ Medium | $$$$$ Highest | $$ Low |
| Query quality | āā Basic | āāāā Very Good | āāāāā Excellent | āāā Good |
| Entity awareness | ā No | ā Yes | ā Yes | ā Yes |
| Relationship modeling | ā No | ā Yes | ā Yes | ā Yes |
| Community detection | ā No | ā No | ā Yes | ā No |
Advanced Topics ā
Custom Similarity Metrics ā
The implementation uses cosine similarity by default. To use other metrics, you'd need to:
- Modify the vector store to use different distance metrics
- Adjust the threshold accordingly
// Using Euclidean distance (via custom storage)
const graph = createGraph({
provider: similarityGraph({ similarityThreshold: 0.5 }),
storage: {
vector: customVectorStore({ metric: 'euclidean' }),
},
});Graph Analysis ā
// Get the underlying graph structure
const graphStore = graph.storage.graph;
// Analyze node connectivity
const degree = await graphStore.nodeDegree('chunk-123');
// Get all edges for a node
const edges = await graphStore.getNodeEdges('chunk-123');Exporting the Graph ā
// Export to GraphML
const graphml = await graph.export('graphml');
// Export to JSON
const json = await graph.export('json');Troubleshooting ā
Too Many/Too Few Edges ā
Problem: Graph is too sparse or too dense
Solution:
// Too sparse ā lower threshold
const graph = createGraph({
provider: similarityGraph({ similarityThreshold: 0.6 }),
});
// Too dense ā raise threshold
const graph = createGraph({
provider: similarityGraph({ similarityThreshold: 0.8 }),
});Slow Insertion ā
Problem: O(N²) edge creation is slow
Solution:
- Use smaller chunks (reduce N)
- Increase threshold (fewer edges)
- Use batching for large documents
- Consider Fast GraphRAG for large datasets
Poor Retrieval Quality ā
Problem: Results not relevant
Solution:
- Adjust
topK(try 5-20) - Adjust
maxDepth(try 1-3) - Check similarity threshold
- Verify embedding quality
- Consider using entity-aware algorithms
Best Practices ā
- Start with defaults -
threshold: 0.7,maxDepth: 2,topK: 10 - Tune threshold based on your domain and embedding model
- Use sparse graphs for better performance (higher threshold)
- Benchmark against pure vector search (maxDepth=0) to measure improvement
- Monitor graph density - adjust threshold if too many edges
Next Steps ā
- Algorithm Overview - Compare all algorithms
- LightRAG - Next step up in complexity š§
- Storage Options - Choose your backend
- API Reference - Full API documentation
Source Code ā
View the implementation: