Open Datasets for Conservation AI: Why the Gran Chaco Needs Its Own Training Data

Train a wildlife detection model on iNaturalist and deploy it in the Gran Chaco. Watch it fail. Not because the model is bad. Because the data it learned from doesn’t know this place exists.

The Gran Chaco is the second-largest forest in South America. Over 1.1 million square kilometers across Paraguay, Argentina, and Bolivia. It holds more than 500 bird species, 150 mammal species, and one of the highest deforestation rates on the planet — roughly 8% of global forest loss in the last two decades. You would think that a region this ecologically critical would be well-represented in global biodiversity databases. It isn’t.

The datasets that power conservation AI — iNaturalist, LILA BC, GBIF, Wildlife Insights — are overwhelmingly biased toward the Northern Hemisphere. North America and Europe together account for the vast majority of labeled camera trap images in public repositories. The Neotropics are underrepresented. The dry Chaco specifically is almost invisible. This matters for a simple reason: a model trained primarily on North American white-tailed deer and European red fox will not reliably detect a Chacoan peccary or a maned wolf. Not because the architecture is wrong, but because the training distribution is wrong. The model has never seen these animals, in these lighting conditions, from these camera angles, in this vegetation. Domain shift isn’t a theoretical problem. It’s the reason good models fail in new ecosystems.

We ran into this ourselves. When we started building CFI — our camera trap analysis pipeline for the Chaco — we tried the obvious approach first. Take MegaDetector, a well-known open model for camera trap image filtering, and run it on our data. MegaDetector is excellent at what it does: separating animal detections from empty frames, vehicles, and humans. It was never designed to identify species in the dry Chaco. That isn’t a criticism of the tool. It’s a statement about training data. The model performs well on the ecosystems it was trained on. Ours isn’t one of them. The same story repeats across species classifiers. Fine-tuned models from North American or African datasets degrade quickly when confronted with South American taxa they’ve never seen. Confidence scores stay high — the model doesn’t know what it doesn’t know — but the predictions are wrong.

The fix isn’t better models. It’s better data. Specifically: labeled camera trap images from the Gran Chaco, annotated by people who know the difference between a Chacoan peccary and a white-lipped peccary, between a crab-eating fox and a Pampas fox, between a juvenile tapir and an adult capybara in infrared at 2am. This is the kind of annotation that can’t be crowdsourced to a global platform. It requires regional ecological expertise. A biologist in Michigan shouldn’t be labeling Chacoan fauna any more than we should be labeling wolverines.

That’s why we built Wildsight. Wildsight is a controlled-access camera trap dataset from Paraguay’s Gran Chaco. Every image passes through a four-stage pipeline: detection, taxonomic classification, behavioral annotation, and structured export. The annotations are built with local scientists who work in this ecosystem, not imported from models trained elsewhere. The data covers Mammalia, Aves, and Reptilia. Output formats are designed for the tools ecologists actually use: camtrapR, Distance, and PRESENCE in R, plus CSV and JSON for Python workflows. No reformatting. No adapter scripts. Research-ready. Access is controlled. Researchers apply with a brief project description and get approved within 48 hours. Contributors receive clear attribution, and anyone providing 15% or more of the records used in a publication is offered co-authorship.

This isn’t a data grab. The dataset is published under CC BY-NC 4.0. We build tools, not data traps. The governance model matters as much as the data itself. Conservation researchers in the Global South have been burned before — by platforms that absorbed their data, trained commercial models on it, and returned nothing. We designed Wildsight so that every contributor retains control over how their images are used, including whether they can be used for model training at all.

The broader point is this: conservation AI will only work where it has been trained. A model is only as good as the ecology represented in its training set. If the Chaco isn’t in the data, the Chaco doesn’t get the tools. And the Chaco can’t wait. Between 2001 and 2020, the Paraguayan Chaco lost approximately 27% of its forest cover. Camera traps are deployed across the region by NGOs, universities, and government agencies. The images exist. What’s missing is the infrastructure to turn them into structured, machine-readable intelligence at scale. That’s the gap. Not more cameras. Not better algorithms. Data that actually represents the ecosystem it’s supposed to protect.

If you’re working with camera trap data in the Neotropics, or building species detection models that need to generalize beyond North American and African taxa, we should talk. Wildsight is open to researchers, conservation organizations, and graduate students. Request access here.