Large Language models (LLMs) have witnessed impressive progress and these large models can do a variety of tasks, from generating human-like text to answering questions. However, understanding how these models work still remains challenging, especially due a phenomenon called superposition where features are mixed into one neuron, making it very difficult to extract human understandable representation from the original model structure. This is where methods like sparse Autoencoder appear to disentangle the features for interpretability.
In this blog post, we will use the Sparse Autoencoder to find some feature circuits on a particular interesting case of subject-verb agreement ,and understand how the model components contribute to the task.
Key concepts
Feature circuits
In the context of neural networks, feature circuits are how networks learn to combine input features to form complex patterns at higher levels. We use the metaphor of “circuits” to describe how features are processed along layers in a neural network because such processes remind us of circuits in electronics processing and combining signals.
These feature circuits form gradually through the connections between neurons and layers, where each neuron or layer is responsible for transforming input features, and their interactions lead to useful feature combinations that play together to make the final predictions.
Here is one example of feature circuits: in lots of vision neural networks, we can find “a circuit as a family of units detecting curves in different angular orientations. Curve detectors are primarily implemented from earlier, less sophisticated curve detectors and line detectors. These curve detectors are used in the next layer to create 3D geometry and complex shape detectors” [1].
In the coming chapter, we will work on one feature circuit in LLMs for a subject-verb agreement task.
Superposition and Sparse AutoEncoder
In the context of Machine Learning, we have sometimes observed superposition, referring to the phenomenon that one neuron in a model represents multiple overlapping features rather than a single, distinct one. For example, InceptionV1 contains one neuron that responds to cat faces, fronts of cars, and cat legs.
This is where the Sparse Autoencoder (SAE) comes in.
The SAE helps us disentangle the network’s activations into a set of sparse features. These sparse features are normally human understandable,m allowing us to get a better understanding of the model. By applying an SAE to the hidden layers activations of an LLM mode, we can isolate the features that contribute to the model’s output.
You can find the details of how the SAE works in my former blog post.
Case study: Subject-Verb Agreement
Subject-Verb Agreement
Subject-verb agreement is a fundamental grammar rule in English. The subject and the verb in a sentence must be consistent in numbers, aka singular or plural. For example:
- “The cat runs.” (Singular subject, singular verb)
- “The cats run.” (Plural subject, plural verb)
Understanding this rule simple for humans is important for tasks like text generation, translation, and question answering. But how do we know if an LLM has actually learned this rule?
We will now explore in this chapter how the LLM forms a feature circuit for such a task.
Building the Feature Circuit
Let’s now build the process of creating the feature circuit. We would do it in 4 steps:
- We start by inputting sentences into the model. For this case study, we consider sentences like:
- “The cat runs.” (singular subject)
- “The cats run.” (plural subject)
- We run the model on these sentences to get hidden activations. These activations stand for how the model processes the sentences at each layer.
- We pass the activations to an SAE to “decompress” the features.
- We construct a feature circuit as a computational graph:
- The input nodes represent the singular and plural sentences.
- The hidden nodes represent the model layers to process the input.
- The sparse nodes represent obtained features from the SAE.
- The output node represents the final decision. In this case: runs or run.
Toy Model
We start by building a toy language model which might have no sense at all with the following code. This is a network with two simple layers.
For the subject-verb agreement, the model is supposed to:
- Input a sentence with either singular or plural verbs.
- The hidden layer transforms such information into an abstract representation.
- The model selects the correct verb form as output.
# ====== Define Base Model (Simulating Subject-Verb Agreement) ======
class SubjectVerbAgreementNN(nn.Module):
def __init__(self):
super().__init__()
self.hidden = nn.Linear(2, 4) # 2 input → 4 hidden activations
self.output = nn.Linear(4, 2) # 4 hidden → 2 output (runs/run)
self.relu = nn.ReLU()
def forward(self, x):
x = self.relu(self.hidden(x)) # Compute hidden activations
return self.output(x) # Predict verb
It is unclear what happens inside the hidden layer. So we introduce the following sparse AutoEncoder:
# ====== Define Sparse Autoencoder (SAE) ======
class c(nn.Module):
def __init__(self, input_dim, hidden_dim):
super().__init__()
self.encoder = nn.Linear(input_dim, hidden_dim) # Decompress to sparse features
self.decoder = nn.Linear(hidden_dim, input_dim) # Reconstruct
self.relu = nn.ReLU()
def forward(self, x):
encoded = self.relu(self.encoder(x)) # Sparse activations
decoded = self.decoder(encoded) # Reconstruct original activations
return encoded, decoded
We train the original model SubjectVerbAgreementNN
and the SubjectVerbAgreementNN
with sentences designed to represent different singular and plural forms of verbs, such as “The cat runs”, “the babies run”. However, just like before, for the toy model, they may not have actual meanings.
Now we visualise the feature circuit. As introduced before, a feature circuit is a unit of neurons for processing specific features. In our model, the feature consists:
- The hidden layer transforming language properties into abstract representation..
- The SAE with independent features that contribute directly to the verb -subject agreement task.

You can see in the plot that we visualize the feature circuit as a graph:
- Hidden activations and the encoder’s outputs are all nodes of the graph.
- We also have the output nodes as the correct verb.
- Edges in the graph are weighted by activation strength, showing which pathways are most important in the subject-verb agreement decision. For example, you can see that the path from H3 to F2 plays an important role.
GPT2-Small
For a real case, we run the similar code on GPT2-small. We show the graph of a feature circuit representing the decision to choose the singular verb.

Conclusion
Feature circuits help us to understand how different parts in a complex LLM lead to a final output. We show the possibility to use an SAE to form a feature circuit for a subject-verb agreement task.
However, we have to admit this method still needs some human-level intervention in the sense that we don’t always know if a circuit can really form without a proper design.
Reference
[1] Zoom In: An Introduction to Circuits