AI Readiness for Transport Data: Making the World's Transport Data Agent-Accessible

The shift from human-facing portals to AI-agent-facing data services providing insights for human decision-making

From Human Portals to Agent Interfaces

Every transport data portal in existence was built for humans. A human clicks through a website, selects filters from dropdowns, downloads a CSV, opens it in Excel, squints at columns. Some portals expose APIs, but these APIs are buried deep behind documentation pages, require bespoke integration work, and were designed for developers building specific applications not for autonomous AI agents that need to discover, understand, and query data on the fly.

The world has changed. AI agents are becoming the primary consumers of data. Not replacing humans, but acting on their behalf. A policymaker asks a question in plain language; an AI agent figures out which datasets to query, how to query them, what to combine, and returns a synthesised answer. The human never touches a portal.

This changes how data should be served. Not what data we collect — the data is the data. But how we present it, describe it, and make it discoverable for machine consumption.

The concept: build an AI service layer for transport data that makes existing data portals and datasets agent-accessible, without replacing them.

This is not about building a chatbot. It is about building the infrastructure that makes any chatbot, any AI workflow, any automated analysis pipeline able to tap into global transport data. The chatbot is one application. The service layer is the platform.

What AI Readiness Means in Practice

An AI agent trying to answer "What is the climate risk to road corridors in East Africa?" today faces this reality:

It does not know what datasets exist about road corridors in Africa
It does not know that OPSIS has a global infrastructure resilience API
Even if it did, it does not know the API schema, query parameters, or what fields mean
It does not know that PortWatch has related trade disruption data accessible via ArcGIS REST
It cannot combine a geospatial bounding-box query with a statistical indicator query without understanding both query paradigms
It has no way to verify that it is using the right data at the right granularity

AI readiness means solving all six of these problems at the data serving layer, so that any agent — regardless of who built it — can discover, understand, query, and combine transport data.

The Three Layers of AI Readiness

Layer 1: DISCOVERY
  "What data exists? What can it tell me?"
  Metadata that lets agents find relevant datasets
  for any question, without prior knowledge.

Layer 2: COMPREHENSION
  "What does this data look like? What do the fields mean?"
  Schema descriptions, semantic annotations, and
  query pattern documentation — written for machines.

Layer 3: ACCESS
  "How do I get the data I need?"
  Standardised, thin query interfaces that agents
  can call with consistent patterns.

Layer 1: Discovery — Telling Agents What Exists

Today, a human discovers transport data by browsing portals, reading reports, or asking colleagues. An AI agent needs structured, machine-readable metadata that answers:

What datasets exist and what topics/geographies/modes they cover
What questions each dataset can answer (not just field names — semantic capability)
How fresh the data is and how often it updates
What format and access method each dataset uses
What other datasets it can be combined with and how

What This Looks Like

A data registry — a lightweight, structured catalogue that agents consult before making any query. Think of it as the table of contents an agent reads before deciding which chapters to open.

# Example registry entry
- id: opsis-global-infrastructure
  name: "OPSIS Global Infrastructure Resilience"
  provider: "University of Oxford"
  description: "Global road, rail, air, and maritime infrastructure networks with climate risk exposure analysis"
  capabilities:
    - "Infrastructure network topology for any country"
    - "Climate hazard exposure for transport corridors"
    - "Multi-modal connectivity analysis"
  geographic_coverage: "Global"
  transport_modes: ["road", "rail", "air", "maritime"]
  data_types: ["geospatial", "network", "risk-indicators"]
  update_frequency: "Quarterly"
  access_method: "rest_api"
  api_base: "https://global.infrastructureresilience.org/api/v1"
  combinable_with:
    - id: portwatch
      relationship: "Port disruptions can be overlaid on OPSIS maritime network nodes"
    - id: world-bank-indicators
      relationship: "Country-level transport indicators can contextualise infrastructure data"
  license: "ODbL + CC-BY-4.0"

This is not a new idea — it is what STAC does for earth observation, what CKAN does for open data portals, what a tools manifest does for an AI agent. But it does not exist for global transport data in an agent-friendly form.

Skills and MCP Servers

In current AI agent tooling, capabilities are delivered in two complementary forms:

Skills are packaged units of knowledge and behaviour that an agent can be given. A skill bundles together the context an agent needs — what a data source contains, how to query it, what the fields mean, what patterns to follow — so the agent can act competently on a domain it has never seen before. Skills can be as simple as a prompt with structured instructions, or as rich as a multi-step workflow that guides an agent through discovery, querying, and synthesis. Each transport data source becomes a skill an agent can pick up.

MCP servers (Model Context Protocol) are the programmatic counterpart — live services that expose tools an agent can call. An MCP server wraps an API and presents it as a set of typed functions: query this dataset, resolve this geography, fetch these indicators. OpenAI function calling schemas and Anthropic tool definitions serve the same purpose in their respective ecosystems.

These two forms work together. A skill tells the agent what to do and why. An MCP server gives it the tools to do it. A skill might say "to answer questions about port disruptions, use the PortWatch tool with a spatial filter for the region of interest." The MCP server provides the query_port_disruptions function the agent calls.

The discovery layer should be expressible in both forms: skills that any agent can learn from, and tool definitions that any major AI framework can consume directly. Some sources need only a skill (the underlying API is simple enough for the agent to call directly with guidance). Others need an MCP server (the underlying API is too complex or requires translation). Many benefit from both.

Layer 2: Comprehension — Progressive Discovery

Discovery tells the agent what exists. Comprehension tells it what the data means. But comprehension does not need to happen all at once, and it does not require us to build elaborate metadata schemas from scratch.

Most APIs already describe themselves — OpenAPI specs, JSON Schema, GraphQL introspection, OGC capabilities documents, SDMX structure definitions. The problem is not that the information does not exist. It is that it is scattered, inconsistent, and not presented in a way an agent can consume progressively.

Let the APIs Speak for Themselves

Rather than inventing a new metadata format layered on top of every source, the comprehension layer works through progressive disclosure:

Level 1 — Tool description. The MCP tool definition or skill already carries a natural-language description of what the source does and typed parameter signatures. This is often enough for an agent to make a first call. A well-written tool description is schema documentation.

Level 2 — Contract exposure. For sources with OpenAPI specs, JSON Schema, or GraphQL schemas, expose the existing contract directly. The agent can inspect parameter types, enum values, and response shapes on demand — no translation needed. For sources without machine-readable contracts, a minimal description in the skill covers the gap.

Level 3 — Semantic hints. Where field names are ambiguous or join relationships are non-obvious, add lightweight annotations: what country_iso3 means, that OPSIS spatial data uses WGS84, that World Bank and OPSIS can be joined on country code. These are hints in the skill or tool descriptions — not a parallel metadata system.

Level 4 — Examples. A few concrete query/response examples often teach an agent more than any amount of abstract schema. Skills can embed these directly.

What We Do NOT Do

We do not build a bespoke YAML schema for every dataset
We do not replicate information that already exists in API contracts
We do not require every source to conform to a single metadata standard
We do not front-load all comprehension before the agent can act

The principle is: start with what the source already provides, fill gaps with lightweight annotations in skills and tool descriptions, and let the agent learn progressively through interaction. An agent that calls an endpoint and inspects the response learns the schema faster than one that reads a 200-line YAML definition.

Layer 3: Access — Skills and Servers Across a Spectrum

The question for each data source is not "how do we wrap it?" but "what does an agent need to use it?" The answer varies enormously — and that is fine. Not every source needs the same treatment.

The Access Spectrum

The insight from the skills + MCP servers model is that access is a spectrum, not a single architecture. Each source sits somewhere on it:

Skill only — the API is already good enough. Sources like the World Bank REST API or OPSIS return clean JSON, have sensible query parameters, and are well-documented. An agent given a skill that explains what the source contains, how to construct queries, and what the response looks like can call these APIs directly. No middleware needed. The skill is the access layer.

Skill + light guidance — the API is usable but quirky. GraphQL endpoints (Transitland), OGC API Features, or APIs with unusual pagination or auth patterns. The agent can call them directly, but the skill needs to teach specific patterns: how to structure a GraphQL query, how to handle cursor-based pagination, how to pass an API key. Still no server — just a smarter skill.

MCP server — the API is hostile to agents. ArcGIS REST services (deeply nested query model, esri-specific spatial encoding), SDMX (complex dimension-based query syntax, XML responses), OGC WFS/WMS (SOAP, capabilities negotiation, GML). These genuinely need a server that translates between the source's native interface and something an agent can call. The MCP server exposes simple typed tools; the complexity stays behind it.

MCP server + ingestion — there is no API at all. Static downloads on Zenodo, CSV dumps, PDF reports. These need pre-ingestion into a queryable store, with an MCP server in front. This is the most infrastructure-heavy option and should be used sparingly.

What This Means in Practice

Source	API Type	Access Approach
World Bank	REST/JSON	Skill only — agent calls API directly
OPSIS	REST/JSON	Skill only — agent calls API directly
OGC API Features	REST/JSON	Skill with query pattern guidance
Transitland	GraphQL	Skill with GraphQL examples
PortWatch	ArcGIS REST	MCP server — translates spatial queries
OECD/ITF	SDMX	MCP server — translates dimension queries to simple REST
Overture Maps	S3/Parquet	MCP server — DuckDB queries behind simple tools
African Transport DB	Static download	MCP server + ingestion

The ratio matters: most sources need a skill, not a server. This keeps the infrastructure footprint small and puts the intelligence in the agent's understanding of the domain, not in middleware.

The Agent Decides

We do not impose a uniform interface across all sources. Instead, each skill or MCP server follows the conventions natural to its domain. JSON and GeoJSON are preferred where possible, but an agent equipped with the right skill can handle a GraphQL response or a paginated REST API without everything being normalised into a single pattern.

The consistency comes from the skills, not from the plumbing. An agent with skills for five different transport data sources has a consistent mental model — discovery, comprehension, query — even though the underlying calls are different.

Key Considerations: Regular APIs and Geospatial APIs Working Together

Transport data lives in two fundamentally different worlds — tabular/statistical data and geospatial data — and an AI agent needs to work across both in a single reasoning flow.

How Regular (Tabular) APIs Work

Tabular APIs (World Bank, OECD/SDMX, IATI) deal in indicators, time series, and categorical data:

Query model: Filter by dimension (country, year, indicator code), get back rows of values
Spatial reference: Country or region codes (ISO 3166, UN M.49) — abstract geographies, not coordinates
Response size: Typically small (hundreds to thousands of rows)
An agent can: "Get road fatality rates for East African countries, 2020-2025" and receive a clean table

How Geospatial APIs Work

Geospatial APIs (OPSIS, PortWatch ArcGIS, OGC API Features, Overture) deal in geometries and features:

Query model: Spatial filter (bounding box, point+radius, polygon), attribute filter (feature type, hazard class), and return features with coordinates
Spatial reference: Coordinate systems (WGS84 lat/lon most commonly), bounding boxes, geometries
Response size: Can be very large (a road network for a country = thousands of LineString features, megabytes of GeoJSON)
An agent can: "Get all road segments within 50km of Mombasa port that have flood risk > 0.7" and receive a set of map features

The Bridge: How an Agent Uses Both in One Flow

A real question almost always needs both. Consider: "Which East African countries have the worst road safety outcomes relative to their road infrastructure investment?"

An agent needs to:

Tabular query: Get road safety indicators (WHO/World Bank) for East African countries
Tabular query: Get transport infrastructure spending (IATI/DAC) for the same countries
Geospatial query (optional but enriching): Get road network extent (OPSIS) to normalise by network size
Synthesis: Combine the results, compute ratios, rank countries

Steps 1 and 2 are pure tabular. Step 3 is geospatial but the agent only needs an aggregate (total road km per country), not the full geometry. Step 4 is reasoning.

The key design insight: the agent needs to be able to move between tabular and geospatial worlds using shared reference frames — primarily geography (country codes, region names, bounding boxes) and time.

Critical Considerations

1. Geographic Reference Translation

An agent thinks in terms of "East Africa" or "Kenya" or "Mombasa." Different APIs need this expressed differently:

API Type	How Geography is Expressed
World Bank	`country=KEN;TZA;UGA;ETH` (ISO codes)
OECD/SDMX	`REF_AREA=KEN+TZA+UGA+ETH`
OPSIS	`bbox=28.8,-11.7,51.4,5.0` (bounding box)
PortWatch ArcGIS	`geometry={"rings":...}&geometryType=esriGeometryPolygon`
OGC API Features	`bbox=28.8,-11.7,51.4,5.0` or CQL2 spatial filter
Overture Maps	Parquet partition by S2 cell or DuckDB spatial filter

The service layer must provide geographic reference translation: agent says "East Africa", the layer knows this means ISO codes KEN, TZA, UGA, ETH, RWA, BDI, SSD, SOM, DJI, ERI for tabular APIs and bounding box [28.8, -11.7, 51.4, 5.0] for spatial APIs. This is a lookup table plus boundary geometries — straightforward to build, essential for agent usability.

2. Response Size Management

Tabular APIs return kilobytes. Geospatial APIs can return megabytes or gigabytes. An agent's context window cannot hold the full road network of Kenya.

Strategies:

Aggregation at the source: The thin wrapper can aggregate geospatial results before returning them. "Total road km by region" instead of every road segment.
Summary + detail pattern: Return a summary first (counts, totals, statistics), let the agent request detail only if needed.
Spatial simplification: Reduce geometry precision for agent consumption (an agent does not need 15-decimal-point coordinates).
Feature count limits with pagination: Never return unbounded result sets.

3. Temporal Alignment

Different sources have different temporal models:

World Bank indicators: annual snapshots, often 1-2 years lag
OPSIS: quarterly updates, point-in-time infrastructure state
PortWatch: near-real-time disruption events with timestamps
IATI: spending by fiscal year, variable reporting lag

The service layer needs temporal metadata: when was this data last updated, what period does it cover, what is the expected lag. Agents need this to know whether they are combining comparable time periods or mixing 2023 infrastructure data with 2025 spending data.

4. Geospatial Query Patterns That Agents Need

Not all geospatial queries are the same. The service layer should support a small, well-defined set of spatial query patterns:

Pattern	Description	Example
Bounding box	Features within a rectangle	"Road network in this map extent"
Point + radius	Features within distance of a point	"Ports within 100km of Dar es Salaam"
Region lookup	Features within a named administrative boundary	"Infrastructure in Kenya"
Corridor query	Features along a route/line	"Climate risk along the Mombasa-Nairobi corridor"
Intersection	Features that overlap with another geometry	"Road segments that cross flood zones"

Of these, region lookup and bounding box cover 80% of agent use cases. Corridor and intersection queries are more advanced but critical for transport-specific analysis.

The agent should not need to construct WKT geometries or encode ArcGIS spatial reference objects. It should be able to say region=KEN or bbox=36.6,-1.5,37.1,-1.1 and get results.

5. Combining Results Across APIs

The trickiest part: an agent queries two sources and needs to join the results. This requires shared identifiers or spatial joining.

Shared identifiers (preferred where available):

ISO country codes link most tabular datasets
Port LOCODE links port-related data
IATA/ICAO codes link aviation data
No universal identifier exists for roads, rail lines, or transit stops

Spatial joining (when identifiers do not align):

"Find the World Bank transport spending for the country that contains this OPSIS road segment"
This requires the agent (or the service layer) to do a point-in-polygon or feature-in-region lookup
The service layer should provide a spatial reference resolver: given a coordinate or geometry, return the containing country/region/district

Practical approach: The service layer provides a small set of reference resolution tools:

/resolve/country?lat=-1.3&lon=36.8    → {"iso3": "KEN", "name": "Kenya", "region": "East Africa"}
/resolve/region?name=East Africa       → {"countries": ["KEN","TZA",...], "bbox": [28.8,-11.7,51.4,5.0]}
/resolve/port?name=Mombasa             → {"locode": "KEMBA", "lat": -4.04, "lon": 39.67, "country": "KEN"}

These become the agent's spatial vocabulary — simple tools it calls to translate between human geography and machine-queryable spatial references.

6. Agentic Multi-Pass Orchestration

The real power emerges when agents can chain queries across sources. This is not a single API call — it is a multi-step reasoning process:

User: "Which transport corridors in Sub-Saharan Africa are most
       vulnerable to climate disruption relative to their trade importance?"

Agent reasoning:
  Pass 1 (Discovery):  Check registry → need OPSIS (infrastructure + climate risk),
                        PortWatch (trade/port data), World Bank (trade indicators)

  Pass 2 (Reference):  Resolve "Sub-Saharan Africa" → list of countries + bounding box

  Pass 3 (Parallel queries):
    - OPSIS: Get major transport corridors in SSA bbox with climate risk scores
    - PortWatch: Get trade volumes for SSA ports
    - World Bank: Get trade-to-GDP ratios for SSA countries

  Pass 4 (Synthesis):  Join corridor risk data with trade importance,
                        rank corridors by vulnerability * trade impact,
                        return top 10 with narrative explanation

  Pass 5 (Optional):   Generate a map showing the top corridors
                        colour-coded by combined risk-trade score

The service layer does not orchestrate this. The agent does. The service layer's job is to make each individual step trivial — discoverable, understandable, and callable with consistent, simple patterns.

What This Is NOT

Not a data warehouse — we are not centralising data. We are making existing sources agent-accessible.
Not a chatbot — a chatbot is one possible consumer. The service layer serves any AI workflow.
Not a replacement for existing portals — the portals keep serving human users. The service layer sits alongside them.
Not a one-time data scrape — this is a live service that stays in sync with underlying sources.
Not a standard — we are not proposing a new data format. We are creating a translation and discovery layer that works with existing formats.

What This IS

An enabler — any organisation building AI tools for transport can plug in
A multiplier — make the data investment of 25+ existing platforms more valuable by making them composable
A demonstration — show what AI-ready data infrastructure looks like, as a pattern others can replicate across sectors
A practical deliverable — start with 5 sources, prove it works, expand

From Concept to Deliverable

Minimum Viable Service Layer

Start with 5 sources that represent the range of API types and data types:

Source	API Type	Data Type	Wrapper Needed?
World Bank	REST/JSON	Tabular indicators	Minimal — add agent metadata
OPSIS	REST/JSON	Geospatial infrastructure	Minimal — add agent metadata
PortWatch	ArcGIS REST	Geospatial + trade events	Yes — translate ArcGIS patterns to simple REST
OECD/ITF	SDMX	Transport statistics	Yes — translate SDMX to simple REST
IATI/DAC	REST/XML	Financial flows	Moderate — simplify query model, return JSON

Plus:

A data registry with entries for all 25 audited sources (even if only 5 have active wrappers)
A geographic reference resolver for region/country/coordinate translation
Agent tool definitions in MCP-compatible format

What This Demonstrates

With these 5 sources wired up, an agent can answer questions that span:

Infrastructure condition and climate risk (geospatial)
Country-level transport indicators (tabular)
Trade and port dynamics (geospatial + temporal)
Transport sector statistics across OECD and beyond (tabular)
Aid and investment flows into transport (financial)

That is enough to show the principle: data that was never designed to be connected, made connectable through an AI service layer.