Interpretability for Neural TSP Solvers

Investigating the latent space of a neural TSP solver using sparse autoencoders.

Neural TSP Solver

Sparse Autoencoder

Feature Analysis

Introduction

Sparse coding has garnered interest in the AI interpretability community for discovering features and circuits within the latent space of neural networks. While most of this interpretability work has focused on language models, the same techniques can be applied to other modalities. We are interested in applying these techniques on OR-solver models, starting with the TSP. Many recent efforts have used deep learning to either learn heuristics to solve or to directly solve the TSP.

By investigating the latent space of these models, we hope to:

Learn details about the model itself, such as debugging and training dynamics
Identify motifs that could be externally useful to understanding the TSP domain
Extract other learned attributes of TSP instances for transfer learning

TSP Solver

Architecture | Data | Training Method

Architecture

System architecture showing the flow from TSP instances through the policy network to SAE feature extraction and analysis.

We use (RL4CO)'s implementation of (Kool et al.)'s Transformer Pointer Network, which performs next-node prediction to autoregressively construct TSP solutions. We focus on TSP instances with a fixed size of 100 nodes.

Data

Our TSP policy is trained on instances drawn from a configurable distribution. As a default, nodes are drawn uniformly over the unit square, but we also run experiments with more structured distributions, such as clustered Gaussians, concentric shapes (for example, points on a circle), latitude-longitude pairs from real road-network waypoints, and other custom generators. For consistency, the model is only evaluated on instance drawn from the same distribution as the training data.

Uniform

Clusters

Ring

Example TSP instances drawn from different distributions

During training, we resample the instances every epoch. For most settings, we used 100,000 instances per epoch for 300 epochs, or 30M unique instances. Alongside the training instances, we freeze a separate validation environment that resamples the same instances at each validation step. We also solve those validation tours optimally with the Concorde solver, so that we can record our model's performance in terms of suboptimality throughout training.

Training Method

Due to the nature of the TSP, generating optimal solutions at scale for supervised learning is computationally intractable. Instead, we train our solver using reinforcement learning, following the work of (Joshi) allowing the model to learn directly from interaction with the environment rather than relying on supervised labels. Training is conducted using the REINFORCE algorithm with gradient clipping to ensure stable parameter updates. Specifically, we use the RL4CO library's PyTorch Lightning-based training framework, including their TSP environment implementation. The reward signal is the negative tour length, incentivizing the model to find shorter tours. Each training run begins with a warm-up phase consisting of 1,000 greedy rollouts, which helps initialize the baseline and stabilize early policy learning. Afterwards, node selection during rollouts is sampled from a temperature-controlled softmax distribution. Once REINFORCE produces gradeints, we update parameters with the Adam optimizer.

As the model trains, the solution to a fixed instance noticeably straightens out.

Sparse Autoencoder

Interpretability Goal | SAE Background | SAE Training Details

Interpretability Goal

Understanding what the neural TSP solver has learned requires dissecting its latent space in a human-interpretable way. We adopt the language of *A Mathematical Framework for Transformer Circuits* framework (Elhage et al., 2021). In this view, "features" are important directions in the model's latent space that behave like variables: each has a value (its activation) on every forward pass. "Circuits," on the other hand, are relationships between these features that behave like functions. We focus on finding "features." If we can reliably name such variables ("this node is on the convex hull" or "that node is this node's nearest neighbor," for exaple), we build a mechanistic narrative of *why* the policy selects one next node over another.

SAE Background

A common obstacle in interpreting neural networks is *polysemanticity,* a phenomenon where individual neurons encode multiple, almost orthogonal concepts (Olsson et al., 2022). In language models, for example, this would look like a neuron that fires for both French negations and HTML tags, confounding interpretation. A popular tool for "disentangling" the neurons is the Sparse Autoencoder (SAE), a secondary ML model trained on the activations of the model being interpreted. SAE's learn an *over-complete* and *sparse* representation of the activations.

Figure demonstrating dictionary learning

Given the data in n = 2 dimensions (left), learn an overcomplete (d > n), sparse (|{active}| < n) basis to represent the data.

The SAE architecture has 3 compponents: an encoder, an activation function, and a decoder. Its training objective, as an autoencoder, is to reconstruct the input after passing it through its encoder's latent space. As a compression task, autoencoders typically have a smaller-dimensional latent space than the input. The key difference with Sparse Autoencoders is that the latent space is higher-dimensional, with the additional sparsity constraint:

Mathematically, the SAE forward pass consists of three steps. Given a node embedding \(x \in \mathbb{R}^d\), the encoder first projects to the latent space: \[z = xW_{\text{enc}}^{\top} + b\] where \(W_{\text{enc}} \in \mathbb{R}^{n \times d}\) contains the learned feature directions and \(b \in \mathbb{R}^n\) is a bias vector. Next, the sparsification step applies top-\(k\) activation: \[z_{\text{sparse}} = \text{TopK}(z) = \text{ReLU}(z - \tau)\] where \(\tau\) is the \(k\)-th largest value in \(z\), effectively zeroing all but the \(k\) largest activations, creating the sparse bottleneck. Finally, the decoder reconstructs the input: \[\hat{x} = z_{\text{sparse}}^{\top} W_{\text{dec}} + b_{\text{dec}}\] where \(W_{\text{dec}} \in \mathbb{R}^{n \times d}\) maps the sparse latent representation back to the original space.

Because of the sparsity constraint, the discovered features are less likely to be polysemantic, and more interpretable to humans. SAEs have been used to find colors/texture features in vision models (cite), features related to specific concepts (https://transformer-circuits.pub/2024/scaling-monosemanticity/) in LLMs, protein motifs in DNA foundation models (https://www.goodfire.ai/blog/interpreting-evo-2), and anomaly detection in robots (https://arxiv.org/html/2504.11170v2) among others. To our knowledge, our work is the first to apply the same tool to neural OR solvers.

SAE Training Details

Recall that the model works by first embedding the graph, then autoregressively decoding the solution. To capture the richest version of the graph's embedding, we attach the SAE to the encoder output, at the residual stream. Our SAE follows OpenAI's top-k SAE recipe (Gao et al., 2024). At each forward pass we zero all but a token's top-activating 10 % of SAE neurons, giving us an effective L0 constraint while still enjoying gradient flow through the soft top-k mask. The key hyperparameters are:

expansion factor: how many times larger is the SAE's embedding space than the residual stream?
k ratio: k-sparsity of the SAE features (0.1 → 10% of features active)
l1 coefficient: Strength of the l1 sparsity penalty applied to the SAE activations

We collect the SAE's training data by running inference on the TSP model for 100,000 instances, collecting the graph embedding for each one. Crucially, these instances were drawn from the same distribution as the TSP solver's training data. After hyperparameter tuning, the resulting dictionary reconstructs residual activations to near-perfectly, with increasing sparsity as training progressed.

Feature Analysis

Feature Activation Mechanics | Visualising a Feature | Discussion

Feature Activation Mechanics

After training the SAE to discover interpretable feature directions, we want to understand how each feature responds to different TSP instances. A "feature activation" is simply the activation value of a specific neuron (feature) in the SAE's latent space. For a given node embedding \(x\), we compute the SAE's sparse latent representation \(z_{\text{sparse}}\) using the encoder and top-\(k\) sparsification described above. The activation of feature \(i\) is then just \(z_{\text{sparse}}[i]\)—the \(i\)-th component of this sparse vector.

Since each TSP instance contains multiple nodes, feature \(i\) produces a collection of activations \(\{z_{\text{sparse}}[j,i]\}_{j=1}^{N}\) across all \(N\) nodes. To summarize how strongly a feature responds to an entire instance, we compute the mean activation:

\[\mu_i \;=\; \tfrac{1}{N}\sum_{j=1}^{N} z_{\text{sparse}}[j,i].\]

This mean activation helps us sort and identify the most relevant features for different types of TSP instances.

Visualising a Feature

To see what a feature is doing we overlay ten fresh 100-node instances from the same distribution in a single \(x\)-\(y\) plot. Each point's colour encodes the corresponding value of \(z_{\text{sparse}}[j,i]\) (dark = 0, bright = high activation), while marker shape distinguishes which of the ten instances the node belongs to. Because activations are node-wise, this composite heat-map lets us spot geometric or combinatorial regularities at a glance—e.g. gradients that track tour direction, clusters of high-activation nodes near dense regions, or symmetry patterns that correlate with particular edge layouts.

Discussion

For demonstration, we trained an SAE on a model trained on uniform disrtibutions. We found many recurring themes of among the features, shown below:

Feature Type	Example 1	Example 2	Example 3
"I'm on the edge"
"Focus on one spot"
"Linear separator"
Unclear / Mysterious

These feature categories suggest that the SAE successfully disentangles different aspects of spatial reasoning that are important for TSP solving, from boundary detection to spatial clustering and geometric separations.

Policy Config: lr: 1e-05, epochs: 1000, embed_dim: 256, layers: 5, dropout: 0.1, temp: 1.0

Average Activation: 1.4363

SAE Config: lr: 0.001, epochs: 30, exp_factor: 4.0, k_ratio: 0.1, l1_coef: 0.001, batch_size: 64

Interpretability for Neural TSP Solvers

Neural TSP Solver

Sparse Autoencoder

Feature Analysis

Introduction

TSP Solver

Architecture

Data

Training Method

Sparse Autoencoder

Interpretability Goal

SAE Background

SAE Training Details

Feature Analysis

Feature Activation Mechanics

Visualising a Feature

Discussion

Top Activating Instances

Bottom Activating Instances