CNG London: People Who Fight With Data for a Living

Jul 02, 2026

I like events like CNG London more than most conferences, mainly because the people in the room tend to be practitioners. They build things, maintain things and occasionally break things. Or, as Jed Sundwall put it, they are “people who fight with data for a living”.

CNG is not exactly a conference. It is more organised than a meetup and, despite everyone speaking the same cloud-native dialect, apparently not a cult either. It calls itself a forum. I am not sure how much of the Roman meaning survives, but the description felt about right: a public gathering place where people presented their work and ideas, discussed their challenges and invited others to collaborate.

For anyone unfamiliar with it, the Cloud-Native Geospatial Forum is an initiative of Radiant Earth aiming to encourage the use of common cloud technologies and open data formats across the geospatial sector. The event itself was more about what happens once those ideas meet actual users, actual institutions and awkwardly large datasets.

The programme moved through GeoAI, environmental decision-making, foundation-model embeddings, Zarr, ecology, informal settlements, autonomous agents, drought monitoring, energy forecasting and humanitarian drone imagery. It sounds like a lot for one day, but much of it shared common ground around trust, scale, EO-based decision-making and what cloud-native infrastructure now makes possible.

CNG London, reimagined as a Roman forum. Illustration generated with AI.

What happens after the GeoAI demo?

Luca Budello from Innovate UK spoke about the GeoAI UK Outlook, which drew on three roundtable dinners held at the Royal Society in January 2026. Published by Innovate UK Business Connect with LunateAI and Sparkgeo, the report looks at the conditions needed for GeoAI to move from promising demonstrations into wider use.

Luca described a move from data as a service to intelligence as a service. Catalogues, APIs and cloud-optimised formats have made geospatial data much easier to find and open, but the person using them is usually trying to work out what changed, where to inspect or whether intervention is needed. The raster is only part of that process.

He compared this with open banking, where changing the rules around access to financial data allowed other organisations to build services on top of it. The UK already has strong public data institutions in Ordnance Survey, the Met Office and the Office for National Statistics, so it is easy to see the appeal of doing something similar with geospatial data. That would mean making datasets easier to use together across institutions, not simply publishing each one through a better API.

Once those datasets begin feeding recommendations or decisions, access is no longer the whole issue. The result has to make sense in the place and institution where it will be used, and someone needs to be able to trace how it was produced. A model used for agricultural monitoring does not need to meet the same standard as one used in finance, defence or public safety. Nor does validation in the UK guarantee that it will work elsewhere. Landscapes and settlement patterns change, as do administrative systems and the biases present in the available data.

Seen in that context, “human in the loop” sounded too simple. It brings to mind someone appearing at the end of a process, reviewing an output and clicking approve. In practice, someone has to set the limits of the system and investigate results that do not look right. They also need a record of the model and data versions, what was run, in which order and where a person intervened. A confirmation button means little if nobody can reconstruct what happened before it appeared.

One point I took from the discussion was that deterministic workflows do not always fit decisions built around uncertain or probabilistic outputs. A burned-area estimate, flood-depth calculation or vegetation-change product should still be reproducible. What may vary is how the system reaches that calculation. It might interpret the request differently, select another dataset or use the available tools in a different order.

Validating a model or documenting one processing chain does not explain a result if the route through the system can change between runs. Someone reviewing the recommendation needs to be able to see how that particular result was produced.

The UK’s Sovereign AI initiative could help projects move beyond demonstrations by funding compute and infrastructure. Adoption still means paying for deployment, fitting the system into an organisation, maintaining it and deciding who takes responsibility for its use.

Jed remarked that the main use of geospatial information by government is land use. Planning, housing, transport, agriculture, conservation and energy all involve decisions about what can happen in a particular place. For all the discussion of national AI infrastructure, many of its eventual uses may still come down to a piece of land.

Working backwards from the outcome

Noelia Jiménez Martínez and Glen Low from Earth Genome presented Earth Index, a tool for searching satellite imagery through Earth-observation embeddings.

Earth Genome is a nonprofit, but its way of working sounded closer to a small technology company than to a conventional research organisation. Paid projects support some of the work, while philanthropic funding gives it room to develop tools and test new applications. Most people in the room seemed to know the organisation already.

They demonstrated the use of foundation-model embeddings for applications including seagrass mapping, although I was more interested in how they decided what to build in the first place.

Much EO development follows a familiar sequence:

data → tools → decisions → outcomes

Earth Genome works backwards:

outcomes → decisions → tools → data

The starting point is a particular user and the outcome they are trying to achieve. From there, the team works back through the decision that might lead to it, the tool needed to support that decision and, finally, the data required.

In practice, EO projects often start with a dataset, a model someone wants to test or the requirements of a funding call. Work may already be well advanced before anyone asks who is expected to use the result.

They used personas and user stories to make the intended user more specific. A conservation officer and a municipal planner may look at the same landscape, but they are unlikely to ask the same question or use the answer in the same way. They may also have different tolerances for uncertainty and very different consequences if the product is wrong. Some can act on the result directly; others can only use it to support a recommendation.

Outcomes, they argued, should be measurable. The decision should become better, faster or cheaper, and the improvement should be visible beyond the technical performance of the model. That makes cost and value part of the design from the beginning, including what the product costs to produce, how much interpretation it saves and whether it allows the user to do something that was previously too slow, difficult or expensive.

Freshwater products came up as an example of the limits of global coverage. A product may cover the whole world and still be too coarse, poorly calibrated or too detached from local practice to support a decision in a particular place.

Earth Genome’s approach was to begin with well-understood local examples and then test how far they could be generalised. Embeddings can help locate similar conditions elsewhere and reduce the amount of labelled data needed in a new region, but that still does not remove the need to understand the geography and the conditions under which the original examples were collected.

In the seagrass work, AlphaEarth Foundations embeddings performed better than the other foundation-model embeddings they tested. Their interpretation was that its pretraining data included benthic environments, making the embeddings a better fit for the task.

Satellite embeddings and ground observations

Anil Madhavapeddy presented TESSERA, a global set of 10-metre, pixel-wise Earth-observation embeddings with open weights and cloud-native access through Zarr v3 and STAC.

Instead of returning to the source imagery for every task, users can work with a reusable numerical representation of each location. That can reduce the data and computation needed for searches, classification and other analyses. What can be learned from those embeddings still depends on the information retained from the original satellite observations.

Molly Blank’s talk on bringing ground-level ecological data into the cloud added another side to this. Expert surveys can provide reliable observations, although they are expensive and rarely cover large areas consistently. They may also produce only a modest number of records and, as Molly pointed out, do not necessarily need a cloud platform. Sometimes a CSV is fine.

Citizen-science programmes produce many more records, though the observations reflect where people go, what they recognise and what they find interesting enough to record. GeoParquet can make those data easier to store and query without changing the way they were collected.

Camera traps, acoustic devices, environmental DNA systems and other sensors add another kind of data. They can generate observations continuously and at much larger volumes, often faster than the recordings can be processed, labelled or interpreted. Larger volumes do not necessarily make it easier to tell whether an ecosystem has changed or why.

Molly introduced the idea of an ecosystem state vector as a way of bringing some of these observations together. She asked whether embeddings derived from ground data could help calibrate remote sensing, and whether combining different modalities could reveal meaningful ecological change. Possible applications included detecting invasive species sooner and observing changes in the biotic elements of soil, which may be invisible to satellites but could complement remote sensing.

Zarr chunks, building footprints and browser maps

Sol Cotton from Open Climate Fix used the problem of Zarr chunking to show how the structure of a dataset shapes the way it can be used. He compared it with transport to a World Cup stadium. If access is designed mainly for people arriving by car, congestion and expensive taxis follow, while anyone travelling another way has to work around a system that was not designed for them.

A dataset can sit in object storage and still be painfully slow if its chunks do not match the way people query it. A layout that works well for time series may perform badly when someone requests a spatial slice, and at large scale the cost of that decision is repeated across every request.

Icechunk adds versioning and safer concurrent updates to Zarr, which is useful when several processes update the same data cube and the dataset is maintained over time.

Nissim Lebovits then presented Barrios Visibles, which combines Argentina’s official register of informal settlements with global building footprints. The buildings are used to estimate population and examine flood exposure.

ReNaBaP records an estimated 4.15 million people living in informal settlements, while the building-based method produced an estimate of about 7.59 million. The difference is around 3.4 million people not represented in the official figure.

In this case, the global building data revealed a gap in the national register that would otherwise have been difficult to quantify. Nissim described it as an example of a global dataset outperforming the local record available for a specific task, which is not the same as outperforming local knowledge.

The project joined cloud-hosted datasets without first downloading and rebuilding them locally, then served the results in the browser through PMTiles.

Nissim joked that people cluster, like most geospatial things. Much demographic analysis still assumes that populations are distributed evenly within administrative boundaries.

He also pointed to poor internet connectivity in Argentina. Even so, with the building footprints hosted on Source Cooperative, he could access building-footprint coverage for the whole country within seconds.

When the developer is an agent

Stefan Amberger from Tilebox spoke about lessons from rebuilding an EO compute platform for machine-driven workflows.

The platform is organised around tasks and executions, with task status, logs, traces and runner context available throughout the workflow. These details become especially important when the system is being used by an agent, since the agent needs to inspect what is running and what happened during an execution.

One of Stefan’s messages was to “iterate fast, not perfect”. The emphasis was on making the platform usable by agents and improving it through actual workflows, with enough logging and execution context to follow what the system was doing.

Lightning talks

The lightning talks covered supply-chain analytics, drought mapping, Antarctic ice dynamics and open drone imagery for disaster response.

Jake Wilkins from Epoch presented a pipeline for plot-level supply-chain analytics, where global arrays are populated incrementally as requests arrive instead of being fully computed in advance. The approach seemed well suited to supply-chain work, where requests are often specific to a commodity, location or client. Computing every possible result globally would be wasteful. The arrays are instead filled as requests arrive and can support later requests for the same area or commodity.

Alper Dincer followed with a global drought map built using H3, GeoParquet and DuckDB. DuckDB can read the GeoParquet files directly from object storage and run the spatial queries in process. With H3, some spatial aggregation becomes an ordinary GROUP BY, while DuckDB-WASM allows part of the analysis to run in the browser. Very little infrastructure was needed. The analysis could read files where they were stored and run without a separate spatial database or processing cluster.

Ross Slater from the University of Leeds described the processing of around 150,000 Sentinel-1 image pairs for Antarctic ice-velocity mapping. The work ran on university HPC and produced a 53-terabyte Zarr cube. VirtualiZarr and Zarr v3 sharding allowed users to access subsets without copying or rebuilding the full dataset.

Petya Kangalova from the Humanitarian OpenStreetMap Team spoke about open drone imagery during disasters. The Drone Tasking Manager is used to coordinate flights by local drone pilots, while OpenAerialMap provides somewhere to process, publish and share the imagery. Local pilots can cover places that satellites miss because of timing, cloud or resolution, then make the imagery available to people mapping damage and organising the response.

Petya reduced the approach to a simple equation:

open standards + cloud-native infrastructure + community = humanitarian impact

Local pilots also bring the access and knowledge needed to decide where to fly and what matters during the response.

More is different

The closing panel, More is Different, brought together David Eaves, Jack Kelly, Niall Robinson and Kaja Wasik to discuss how technologies operating at very large scales are changing science and policy.

The line “quantity has a quality all its own” appeared early, attributed, with some uncertainty, to… Stalin. My notes include a laughing emoji, which probably captures the room’s response better than I can now.

The panel described 2022 as a turning point for weather forecasting, when machine-learning forecasts began matching or outperforming conventional numerical models. Five years earlier, much of the weather community had not expected this to happen.

Jack Kelly spoke about Dynamical’s work with weather data produced by organisations such as the Met Office. Using shared infrastructure built around Icechunk, a user could lazily open close to a petabyte of data with one line of Python, without downloading the full archive.

Infrastructure at that scale still needs an owner. Someone has to fund it, maintain it and decide which versions of the data are authoritative. Those responsibilities become more difficult when services built on top of it can reach several countries, or even billions of people, before any shared governance arrangements exist.

States had shown relatively little interest in the internet, Facebook and cloud infrastructure, according to the discussion. They care now. That involvement may be welcome or unwelcome depending on the state and the decision, but it is now difficult to discuss data infrastructure without considering the role of government.

Niall Robinson noted that when people use a Met Office forecast, they are also relying on the authority of the institution behind it. A private company might reproduce the technical pipeline without inheriting the same public trust.

David Eaves said that some sources of truth belong to government, at least in the UK. The claim was about institutional responsibility rather than infallibility. Public records need an organisation that defines and maintains them, corrects them when necessary and remains answerable for their use in public decisions.

Jed closed the discussion with an open question: “What are the planetary-scale data institutions?”

I did not write down whether anyone answered it, but for me the question is mostly about continuity and whether a usable archive remains. Landsat is the obvious example. Its record now stretches back more than fifty years, and those observations are still available, calibrated and documented well enough to be used today. That did not happen simply because the satellites collected the data. Institutions kept the archive, reprocessed it and made sure later users could still understand what they were looking at.

WMO coordinates weather observations, Copernicus runs long-term environmental services, and organisations such as CEOS and the Group on Earth Observations (GEO) help agencies and governments agree on standards and priorities. A mission may be operated by one agency, while its calibration, derived products and archive are maintained elsewhere.

Ten or twenty years later, can someone still find the data, understand how they were produced and use them without depending on a project or service that no longer exists?

A very 2026 set of problems

During the final discussion, I kept thinking how strange much of the day would have sounded ten years ago. We were talking about global benchmarks for energy forecasting, opening petabyte-scale archives without keeping local copies, reading EO data from object storage and putting together a dashboard in an afternoon.

Once data can move this easily, it also becomes easier to lose track of it. One version ends up in a model, another in a dashboard, and the project that produced the data may be over by the time someone asks how they were made or whether they are still maintained. Standards and governance came up throughout the day partly because the technical possibilities have moved so quickly.

Barrios Visibles and HOT showed how uneven that progress remains. Poor connectivity and limited local infrastructure made lightweight browser delivery important for Barrios Visibles. HOT’s disaster-response work relied on open imagery and tools that local pilots and mapping communities could use without much supporting infrastructure. Global datasets may help where local records are incomplete. They do not provide the missing census or the people needed to keep a system running.

Energy-forecasting papers often use different datasets, periods and metrics. Shared benchmarks would make the results easier to compare. Whoever designs the benchmark still decides which regions are represented, which timescales matter and which errors receive the most attention.

The word commoditise came up several times, especially around cloud infrastructure. Cloud providers were compared to electricity suppliers: once the plugs are the same, changing provider should become fairly routine. Object storage is starting to move in that direction. S3 began as an AWS interface, but it is now supported by storage systems far beyond AWS, so many of the same tools can work across providers with little adjustment.

Pricing, data-transfer costs and implementation details still vary between providers. S3 has become something close to a standard because so much of the ecosystem supports its interface.

Leaving the forum

Many people in the room had been riding the AI wave, keeping up with new capabilities and trying to work out what was worth building. Practitioners, builders, doers. The room also included people maintaining the open-source software much of this work depends on, and others working on standards, governance and adoption as the sector changes so quickly.

HOT gave that energy a clear purpose. Open imagery, local pilots and lightweight tools came together to support disaster response.

I am less certain about where all the other technical energy will end up. We can build very quickly, but we can also produce overlapping tools, temporary platforms and data products that nobody has agreed to maintain. At some point, more activity can simply make the sector harder to navigate.

Talking to people about what they are working on always leaves me with new ideas and more motivation. I still enjoy the details of building things, but I find myself paying more attention to the sector around them: what people are adopting, what is missing and what will still be here a few years from now.

References

The GeoAI UK Outlook: As part of the Innovate UK GeoAI Festival [link]
Codex Planetarius Pilot Study: Palm in Paradise - A Supply Shed Analysis of Indonesian Palm Oil [link]
Awesome DuckDB Spatial [link]
A Serverless Approach to Building Planetary-Scale EO Datacubes in Zarr [link]
Designing a data pipeline for our highest-resolution dataset yet [link]

Spectral Reflectance

Discussion about this post

Ready for more?