Implementing Foreign Key Tracking in Memory

Implementing Foreign Key Tracking in Memory

Introduction

One of the biggest challenges that data synchronization tools like Neosync face is dealing with foreign key relationships correctly. You need to ensure that when you sync data across databases, you maintain referential integrity - meaning that if table A references table B, you need to sync table B's data first. But it gets even more complicated because there could be circular dependencies in the table, across tables or both!

In this blog, I'm going to walk through a basic, sample implementatinhow we can implement an in-memory foreign key tracker that helps us handle these dependencies efficiently. We'll use Go for the implementation since we write in Go. This blog won't cover every edge case (there are too many to list), but the goal is for you to get an idea of how to approach this problem.

Why track foreign keys in memory?

Before we dive into the code, let's talk about why we'd want to track foreign keys in memory versus just querying the database for this information.

  1. Performance - Querying the database for foreign key relationships every time we need this info is expensive, especially when dealing with hundreds of tables
  2. Cross-database support - Different databases handle foreign key metadata differently. Having an in-memory representation gives us a consistent interface
  3. Dependency resolution - We can use this data structure to efficiently determine the order in which to sync tables

Building the data structure

The data structure is arguably the most important part of this. Nailing the data structure will make writing the actual algorithm much easier. Since we're going to be doing a lot of look-ups, we'll heavily rely on maps.

Let's start with defining our core types:

type ForeignKey struct {
    TableName  string
    ColumnName string
    RefTable   string
    RefColumn  string
}
 
type TableGraph struct {
    // Map of table name to its foreign key constraints
    tables map[string][]ForeignKey
    // Map to track circular dependencies
    visited map[string]bool
    // Mutex for concurrent access
    mu sync.RWMutex
}
 
func NewTableGraph() *TableGraph {
    return &TableGraph{
        tables:   make(map[string][]ForeignKey),
        visited:  make(map[string]bool),
    }
}

The ForeignKey struct represents a single foreign key relationship. The TableGraph struct is our main data structure that tracks all relationships between tables.

Adding relationships

Now let's implement methods to add foreign key relationships:

func (g *TableGraph) AddForeignKey(fk ForeignKey) error {
    g.mu.Lock()
    defer g.mu.Unlock()
 
    // Initialize slice if table doesn't exist
    if _, exists := g.tables[fk.TableName]; !exists {
        g.tables[fk.TableName] = make([]ForeignKey, 0)
    }
 
    // Add the foreign key
    g.tables[fk.TableName] = append(g.tables[fk.TableName], fk)
    return nil
}

Handling circular dependencies

One of the trickiest parts of foreign key tracking is handling circular dependencies. For example, if table A references table B and table B references table A, we need a way to detect and handle this.

Here's how we can implement cycle detection:

func (g *TableGraph) HasCycle(tableName string, visited map[string]bool, path map[string]bool) bool {
    visited[tableName] = true
    path[tableName] = true
 
    for _, fk := range g.tables[tableName] {
        next := fk.RefTable
        if !visited[next] {
            if g.HasCycle(next, visited, path) {
                return true
            }
        } else if path[next] {
            return true // Found a cycle
        }
    }
 
    path[tableName] = false
    return false
}

This uses a depth-first search approach to detect cycles in our graph of table relationships.

Getting sync order

Now for the most important part - determining the order in which to sync tables. We'll use a topological sort to get this:

func (g *TableGraph) GetSyncOrder() ([]string, error) {
    g.mu.RLock()
    defer g.mu.RUnlock()
 
    visited := make(map[string]bool)
    order := make([]string, 0)
 
    var visit func(string) error
    visit = func(table string) error {
        if visited[table] {
            return nil
        }
 
        visited[table] = true
 
        // Visit all dependencies first
        for _, fk := range g.tables[table] {
            if err := visit(fk.RefTable); err != nil {
                return err
            }
        }
 
        order = append(order, table)
        return nil
    }
 
    // Visit all tables
    for table := range g.tables {
        if err := visit(table); err != nil {
            return nil, err
        }
    }
 
    return order, nil
}

This returns tables in an order where dependencies are synced before the tables that depend on them.

Using it in practice

Here's how you might use this in a real application:

func main() {
    graph := NewTableGraph()
 
    // Add some relationships
    graph.AddForeignKey(ForeignKey{
        TableName:  "orders",
        ColumnName: "user_id",
        RefTable:   "users",
        RefColumn:  "id",
    })
 
    graph.AddForeignKey(ForeignKey{
        TableName:  "order_items",
        ColumnName: "order_id",
        RefTable:   "orders",
        RefColumn:  "id",
    })
 
    // Get sync order
    order, err := graph.GetSyncOrder()
    if err != nil {
        log.Fatal(err)
    }
 
    // order will be: [users, orders, order_items]
    for _, table := range order {
        fmt.Printf("Sync table: %s\n", table)
    }
}

Performance considerations

This memory footprint of this implementation is O(N) where N is the number of foreign key relationships in your database. This would probably work for most applications - even those with thousands of tables and relationships. There are definitely edge cases with this like virtual foreign keys (which we didn't account for) and others but this is a good enough start.

Wrapping up

Building an in-memory foreign key tracker might seem complex at first, but breaking it down into these components makes it manageable. The key is having a clear data structure and handling the edge cases like circular dependencies properly.

We use this pattern extensively in Neosync to ensure data integrity when syncing across databases. It's also useful for other database tools like schema migration systems or backup tools.

The complete code for this implementation can be found in our open source repo if you want to see how we use it in production.


Introducing Free-Form Text Anonymization for AI and Machine Learning Workflows

Introducing Free-Form Text Anonymization for AI and Machine Learning Workflows

Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows

December 13th, 2024

View Article