SNNs vs ANNs: What the Benchmarks Actually Show

The claim is simple: Spiking Neural Networks are more efficient than conventional artificial neural networks because they only compute when spikes arrive, and most neurons are silent most of the time. The claim is true. The implication that SNNs are therefore ready to replace ANNs in general AI applications is not supported by the current research.

The benchmarks are more interesting than either the enthusiasts or the skeptics suggest.

The Theoretical Foundation

Wolfgang Maass established the computational basis for spiking networks in a 1997 paper in Neural Networks: "Networks of Spiking Neurons: The Third Generation of Neural Network Models." The core finding was that spiking neurons are computationally more powerful than sigmoidal neurons at the unit level. A single spiking neuron can compute functions that require hundreds of hidden units in a conventional network, because the timing of spikes carries information that rate-coded activations cannot express.

This theoretical advantage did not translate to practical superiority for two decades, because training spiking networks was intractable. The reason is the spike function itself. A neuron fires when its membrane potential crosses a threshold, producing a binary event. The mathematical description of this is a Heaviside step function, which has a derivative of zero everywhere except at the threshold, where it is infinite. Backpropagation, the algorithm that makes training practical, requires computing gradients through the activation function. A Heaviside function has no useful gradient.

Pfeiffer and Pfeil surveyed the field in a 2018 Frontiers in Neuroscience review, "Deep Learning With Spiking Neurons: Opportunities and Challenges," and found that at that point only TrueNorth, SpiNNaker, and BrainScaleS had demonstrated deep SNNs running on dedicated silicon at all. The gap between biological motivation and engineering capability was large.

The breakthrough came from surrogate gradients. Zenke and Ganguli's 2018 paper in Neural Computation, "SuperSpike: Supervised Learning in Multilayer Spiking Neural Networks," introduced the fast sigmoid as a surrogate for the Heaviside derivative: during the forward pass, real spikes propagate; during the backward pass, the gradient of a smooth function centered at the threshold is used instead. This is a principled approximation rather than a rigorous gradient, but it works empirically. Neftci, Zenke and Mostafa (2019) in IEEE Signal Processing Magazine formalized the framework and it became the standard method for direct SNN training. SpikingJelly, a full-stack SNN framework described by Fang et al. in Science Advances (2023), implemented surrogate gradient training with an 11x acceleration over previous training infrastructure and brought direct SNN training within reach of researchers without hardware specialization.

What the Accuracy Numbers Show

The honest accuracy comparison requires separating three categories of benchmark: static image tasks, event-based sensor tasks, and temporal audio tasks. SNNs perform very differently across these categories.

Static images: still a gap, closing fast

On MNIST, SNN accuracy has reached 98.73% at a single timestep. A conventional neural network reaches approximately 99.7%. The gap is small but real, and the SNN achieves it at dramatically lower computational cost.

CIFAR-10 tells a more encouraging story. The STAA-SNN architecture reaches 97.14%, compared to ANN baselines around 98.6%. A gap of 1.5 percentage points represents meaningful progress from early SNN results on this benchmark that were 10 or more points behind.

CIFAR-100 shows the gap widening with task complexity: the STAA-SNN achieves 82.05% against ANN performance above 95%. The approximately 13-point gap reflects the difficulty of maintaining high accuracy at high timestep efficiency on complex classification tasks.

ImageNet is the benchmark that matters most for practical applications. In 2020, state-of-the-art SNN performance on ImageNet was around 67-70% top-1 accuracy using SEW-ResNet architectures. By 2025, MSVIT (a multi-scale spiking vision transformer reported at IJCAI 2025) reached 85.06% top-1 and 97.58% top-5. SGLFormer reached 83.73% and QKFormer 84.22%. SEW-ResNet152 reached 77.30%. The best ANN vision transformers exceed 90%.

The trajectory matters as much as the current numbers. SNNs improved approximately 15 percentage points on ImageNet between 2020 and 2025. The PMSM approach demonstrates 81.6% top-1 at a single timestep (T=1), which has direct implications for latency. The gap is closing at a rate that suggests competitive accuracy within a few years.

Event-based data: SNNs win clearly

The picture reverses on benchmarks derived from event cameras and neuromorphic sensors. On N-MNIST (the neuromorphic version of handwritten digits, recorded with a DVS event camera) and DVS-Gesture (hand gesture recognition from event camera data), SNNs achieve accuracy above 98%. On the Spiking Heidelberg Digits audio dataset, which encodes spoken digit recordings as spike trains across 700 channels, SNNs exceed 93%.

The reason is architectural compatibility. Event camera data is already a sequence of binary events. An SNN processes it natively without encoding overhead. An ANN must first convert the event stream into a frame-based representation, throwing away the temporal precision that makes event camera data useful. The SNN does not just match ANN performance on this data, it processes it in the format it was captured in.

What the Energy Numbers Show

The energy advantage of SNNs is real but conditional. The condition is spike sparsity.

A CEA hardware-aware study provides the clearest quantification of this. When spike sparsity is below 0.1 spikes per synapse per timestep, SNNs are 3.6 times more energy-efficient than equivalent ANNs. When spike sparsity rises above 0.5 spikes per synapse, SNNs lose their energy advantage entirely and become less efficient than ANNs. The crossover point is around 0.3-0.4 spikes per synapse.

This is the critical number that the general coverage of neuromorphic computing usually omits. Energy efficiency is not an inherent property of spiking networks. It is a property of sparse spiking networks. A dense SNN that fires most of its neurons on every timestep does not save energy compared to an ANN. It may use more.

The absolute measurements across studies are consistent. Conventional ANN inference uses approximately 200 millijoules per inference on a reference system. A converted SNN running the same task uses approximately 20 millijoules, a factor of 10 improvement. An SNN trained directly with STDP or surrogate gradients, optimized for sparse activity, reaches approximately 5 millijoules per inference, a factor of 40 improvement. The best-optimized implementations have demonstrated 12.55 nanojoules per inference, a 99.72% reduction from a CNN baseline of 4,421.52 nanojoules on the same task.

The 30-60 times reduction in multiply-accumulate operations that sparse SNNs achieve versus ANNs translates directly into energy savings when running on hardware that only computes on spike arrival. On conventional GPUs, which compute synchronously regardless of activation sparsity, much of this theoretical advantage disappears. The hardware architecture must match the computational model for the energy benefit to be captured.

VGG-16 on CIFAR-10 and CIFAR-100 shows a 5x energy reduction using SNN implementations versus ANN baselines. Sigma-delta neuron encoding with direct input representation achieves 3x efficiency. Optimization of synaptic operations through network pruning and regularization can reduce energy by 84%, from 3.86 million synaptic operations to 0.63 million per inference.

What the Latency Numbers Show

The latency story is about timesteps. An SNN processes input over multiple timesteps: the network accumulates spikes over time before producing an output. More timesteps means higher accuracy and more latency.

Early ANN-to-SNN conversion methods required 640 or more timesteps for lossless accuracy preservation. At 640 timesteps, the latency advantage of a spiking network disappears and the energy cost grows proportionally with timestep count. The field recognized this as the central practical barrier to SNN deployment.

Recent work has compressed the timestep requirement dramatically. TSC-SNN achieves 92.79% on CIFAR-10 at 64 timesteps, a 3x reduction from the prior standard of 200 timesteps. On event-based datasets, timestep reductions of 1.64 to 1.95 times are achievable while maintaining accuracy, because the input data is already temporal and requires fewer processing steps. On standard image benchmarks, latency reductions of 1.76 to 2.76 times over prior methods have been demonstrated on CIFAR-10.

The theoretical endpoint is T=1: a single timestep, meaning the SNN processes input and produces output in one forward pass with no temporal accumulation. The PMSM method demonstrates 81.6% top-1 accuracy on ImageNet at T=1, within 9 points of the best multi-timestep SNN result and within 8 points of frontier ANN performance. At T=1, the latency advantage is maximal and the energy advantage is substantial because each neuron fires at most once.

The Training Gap and How It Is Closing

The accuracy gap between SNNs and ANNs reflects a training gap as much as an architectural one. ANNs have thirty years of optimization: learning rate schedules, batch normalization, dropout, residual connections, attention mechanisms. SNNs have had practical training methods for less than a decade.

Surrogate gradient descent, formalized by Zenke and Ganguli in 2018 and systematized by Neftci et al. in 2019, enabled backpropagation through spiking networks for the first time. The method is an approximation but it works well enough to close most of the accuracy gap on simpler benchmarks. The remaining gap on complex tasks reflects two things: insufficient training infrastructure and the inherent tension between temporal coding efficiency and the static nature of most benchmark datasets.

SpikingJelly (Fang et al., Science Advances 2023) addressed the infrastructure gap by providing a full-stack Python framework with 11x training acceleration, standard neuron models, hardware backends for multiple neuromorphic chips, and support for ANN-to-SNN conversion alongside direct training. Before SpikingJelly and similar tools, reproducing SNN results from papers required reimplementing the training infrastructure from scratch. The ecosystem is now comparable to early PyTorch in maturity.

ANN-to-SNN conversion remains a viable alternative to direct training for groups that want to deploy existing trained models on neuromorphic hardware. The conversion maps ANN activations to firing rates, treating the output of a spiking neuron over multiple timesteps as an approximation of the original neuron's continuous activation. Rueckauer et al.'s reset-by-subtraction normalization strategy improved conversion quality significantly, and recent work has reduced the required timestep count from hundreds to tens while preserving most of the accuracy.

Where This Points

The benchmarks tell a clear story about which applications SNNs are ready for now and which require continued research.

SNNs are ready for deployment on event-based sensor data. DVS cameras, silicon cochleas, tactile sensors, radar, and other sensors that produce event streams natively pair with SNNs without the encoding overhead that degrades performance on static image benchmarks. The 98%+ accuracy on N-MNIST and DVS-Gesture at power levels an order of magnitude below ANN alternatives is not a research result. It is a deployment argument.

SNNs are ready for audio and temporal pattern recognition tasks, where the inherent temporal dynamics match the structure of the data. Speech processing, keyword spotting, and anomaly detection in time series all benefit from the temporal coding that SNNs handle natively.

SNNs are approaching competitiveness on static image classification. The MSVIT result of 85.06% on ImageNet is within reach of production use cases where the energy advantage justifies a few percentage points of accuracy trade-off, particularly at the edge where a GPU alternative is not an option.

SNNs are not yet competitive for NLP and large language model tasks. The dense, synchronous computation that transformer architectures require does not map naturally to sparse, event-driven spiking computation. This may change as hybrid architectures develop, but it is not the near-term opportunity.

The hardware-software co-design opportunity is the one that most matters. The energy advantage documented in the CEA study is only accessible when the hardware computes on spike arrival rather than on a fixed clock. Running SNNs on conventional GPUs captures some computational savings but not the full hardware-level efficiency benefit. The full advantage requires neuromorphic hardware. That hardware-software pair, a well-trained sparse SNN running on event-driven neuromorphic silicon for a task where the input data is naturally temporal, is where the research results become deployment results.

The benchmarks show a technology that is competitive in specific domains now, closing the accuracy gap on general tasks rapidly, and awaiting the hardware accessibility improvements that will determine whether research performance translates into deployed systems.