... visualizações

Identification Model v1.1 - Better Dataset, Better Generalization

Introduction

After shipping the v1.0.0 identification model, the next step was not to redesign the architecture, but to scrutinize the data and training pipeline that feeds it. V1.1.0 is the result of that work: a focused set of surgical improvements that, taken together, produced the biggest leap in real-world generalization the AntScout model has seen to date.

The headline number tells the story, as the out-of-distribution AntScout dataset went from 53.97% → 63.10% image-only (NoLoc) accuracy, a +9.13% jump. For a model that already sits well above iNaturalist and BioCLIP 2 on this benchmark, that is a substantial gain.


What Changed in V1.1

1. Cleaner Dataset — Removing Non-Ant Classes

The training data sourced from AntWeb contains a broad range of arthropod imagery. Previous versions included classes such as beetles and bees that crept in alongside the ant specimens. These non-ant classes added noise to the feature space: the model had to dedicate representational capacity to insects that users will never submit for ant identification.

Fix: All non-ant insect classes (beetles, bees, and similar outliers) were purged from the dataset. This tightened the class distribution and allowed the model to focus entirely on the fine-grained morphological differences between ant species, which is exactly what ArcFace + DINOv3 excels at.


2. FiLM Conditioning Disabled

In v1.0.1, FiLM (Feature-wise Linear Modulation) parameters were generated from the location embedding and used to modulate DINOv3 visual features, with a film_scale of 0.7. The hypothesis was that telling the model where an image was taken would help it weight geographically plausible species higher even in the visual backbone itself.

In practice, the effect was the opposite for out-of-distribution images: FiLM conditioning slightly reduced NoLoc accuracy, indicating that the modulation was entangling visual and spatial features in a way that hurt pure image-based generalization.

Fix: FiLM conditioning was disabled. Location data still enters through the classification head via the location embedding, but it no longer modulates the visual feature extraction itself. The result is a cleaner separation between “what do I see” and “where was this taken.”


3. Leaner Model Head - Halved Embedding and Bottleneck Dimensions

The v1.0.1 head used:

  • Embedding size: 4096-dim final feature vector
  • Bottleneck hidden dim: 8192 → 4096 wide bottleneck

These dimensions were generous but turned out to be larger than necessary for the task. A smaller head reduces parameter count, speeds up training, and can help prevent the head from memorizing training distribution features.

Fix: Both dimensions were halved:

ParameterV1.0.1V1.1.0
embedding_size40962048
bn_hidden_dim81924096

The leaner head retained all accuracy on the standard validation set while contributing to better generalization.


4. Fixed Habitat Data Format Consistency

The 40+ ecological habitat layers (BioClim, SoilGrids, land cover, topography) sampled per observation were found to have formatting inconsistencies. The source TIFF files were not all in the same format, which caused some of the habitat layers to not be read correctly. Instead of providing the correct environmental features, these mismatched layers effectively injected noise into every observation that had valid GPS coordinates.

Fix: All habitat TIFF files were standardized to the same coordinate reference system and format, ensuring that every layer is correctly parsed and matched. This made the location signal trustworthy again, which directly benefits the Localized accuracy metrics.


5. Improved Training Efficiency - Sharded Tars & Head-Only Warm-Up

Two infrastructure improvements sped up the training cycle significantly:

  • Sharded tars: Training data is now packed into sharded WebDataset .tar archives. This enables fast sequential I/O with no random disk seeks, which was a bottleneck on the previous NVMe setup when training on 7.7M images.
  • Head-only retraining: Rather than unfreezing the full backbone from scratch, the new training run reused the previous backbone weights and only retrained the head first. This allows the model to reach a sensible feature space faster before the full end-to-end fine-tuning phase begins, reducing wasted compute on early epochs.

Performance Results: V1.0.1 → V1.1.0

The comparison below focuses on the model weights, which consistently outperform the standard weights due to exponential moving averaging smoothing out gradient noise.


Standard Validation Set

MetricV1.0.0V1.0.1V1.1.0
Loc (Top-1)69.37%69.02%71.41%
NoLoc (Top-1)57.03%64.58%63.73%

AntScout Dataset (Out-of-Distribution)

MetricV1.0.0V1.0.1V1.1.0
Loc (Top-1)64.37%61.99%66.93%
NoLoc (Top-1)42.18%53.97%63.10%

This is where v1.1 truly delivers. The AntScout dataset consists of real user-submitted field photos with noisy backgrounds, variable lighting, and amateur camera angles, as opposed to the controlled iNaturalist imagery in the standard validation set. The +9.13% NoLoc improvement on this benchmark is the clearest evidence that the model learned genuinely more robust visual features rather than overfitting to training data statistics.

Accuracy Comparison - With Location (Loc)
Val - Loc *
69.37%
V1.0.0
69.02%
V1.0.1
71.41%
V1.1.0
AntScout - Loc **
64.37%
V1.0.0
61.99%
V1.0.1
66.93%
V1.1.0
V1.0.0
V1.0.1
V1.1.0
* Validation dataset has 1 image per species, so is extremely difficult to get high percentage on
** This dataset is every image from the reported flights on AntScout, which is not cleaned and contains plenty of impossible to identify images
Accuracy Comparison - Without Location (NoLoc)
Val - NoLoc *
57.03%
V1.0.0
64.58%
V1.0.1
63.73%
V1.1.0
AntScout - NoLoc **
42.18%
V1.0.0
53.97%
V1.0.1
63.10%
V1.1.0
V1.0.0
V1.0.1
V1.1.0
* Validation dataset has 1 image per species, so is extremely difficult to get high percentage on
** This dataset is every image from the reported flights on AntScout, which is not cleaned and contains plenty of impossible to identify images

Why the AntScout NoLoc Jump Matters

The +9.13% NoLoc gain on AntScout is the metric I care about most. Here’s why:

  • NoLoc = pure visual signal. When a user uploads a photo without GPS coordinates, the model has only the image to work with. A model that only learned to exploit location priors will collapse here. V1.1 is clearly learning what ants look like, not just where they are found.

  • AntScout is adversarial. Unlike the curated iNaturalist validation set where images are nicely framed and labelled by expert observers, the AntScout dataset reflects real-world conditions: smartphone photos, partial ants, cluttered backgrounds. Performing well here translates directly to a better user experience in the app.

  • The dataset cleanup was the root cause. The removal of beetle and bee classes reduced cross-domain confusion and forced the network to sharpen its ant-specific feature detectors. The habitat format consistency fix meant that every training signal about “ant species X lives in habitat Y” was now actually correct, giving the location head a clean signal to learn from.


Images the Model Got Correct Without Location (Unseen in Training)

Since the NoLoc (image-only) accuracy on out-of-distribution data is our most critical indicator of visual learning, here is a slideshow of real-world observations that the model correctly identified at the species level without location context.

These specific photos were completely unseen during training, meaning the model had to rely purely on fine-grained visual details (like propodeal spines, scape length, and head shape) to get them correct.


Architecture Summary: V1.0.0 vs V1.0.1 vs V1.1.0

ComponentV1.0.0V1.0.1V1.1.0
BackboneDINOv3 ViT-L/16DINOv3 ViT-L/16DINOv3 ViT-L/16 (unchanged)
PoolingCLS + GeMCLS + GeM (unchanged)CLS + GeM (unchanged)
FiLM ConditioningEnabled (film_scale=0.7)Enabled (film_scale=0.7)Disabled
Embedding size409640962048
Bottleneck hidden dim819281924096
Resolution448px512px512px
DatasetIncludes beetles/beesIncludes beetles/beesAnts only
Habitat format consistencyInconsistentInconsistentStandardized
Training formatSequential filesSequential filesSharded tars
Backbone warm-upUnfrozen after 2 epochsFull retrainHead-only → full

What’s Next

V1.1 sets a strong baseline. The cleaner dataset and fixed habitat format consistency are permanent improvements that every future version will benefit from. Potential directions for V1.2 include:

  • Expanding species coverage - adding recently described or underrepresented species from newly curated sources.
  • Revisiting FiLM - with habitat data now correctly aligned and standardized, a properly re-tuned FiLM module might help localized accuracy further without hurting NoLoc.

Conclusion

V1.1 is a strict upgrade over V1.0. The two most impactful changes were:

  1. Removing non-ant classes from the dataset, which forced the model to develop sharper ant-specific visual features.
  2. Fixing the habitat format consistency issues, which made every location-conditioned training signal accurate for the first time.

The result is the highest standard localized accuracy the model has ever achieved (+2.39%), and a 9% leap in image-only out-of-distribution accuracy on the AntScout benchmark, which is the number that matters most for real users identifying ants in the field.

V1.1 is now live in the AntScout identification tool. Try it at /tools/ant-identification.

Deixar um comentário