Accelerating ESG data delivery

In the rapidly evolving landscape of sustainable finance, the ability to process vast quantities of environmental, social, and governance (ESG) data with speed and precision is a critical differentiator. We are proud to present the results of our newly deployed, AI-powered data extraction pipeline—a system designed to automate the ingestion and analysis of official corporate disclosures.
‍

Institutional investors continue to struggle with access to reliable and timely ESG disclosure data, and many act on data that is more than a year old. These delays dampen the responsiveness of the market to corporate sustainability performance in the form of timely capital allocation, stewardship activity, or portfolio construction. AI has been seen as the solution to this problem, but ensuring high quality data extraction from disclosure documents is a challenge that extends far beyond what AI models can do today. Matter recently rolled out the latest updates to our extraction technology, which has more than doubled our automation rates, setting a new standard for the speed at which disclosures turn into investor-ready data. In this piece, we discuss the challenges facing ESG disclosure data, and present how we have integrated a fine-tuned open-source AI model into our extraction architecture to produce client-ready, high-quality ESG disclosure data at speed.

‍

Timely ESG disclosure data is hard to come by

For more than a decade, institutional investors have integrated ESG data into their investment decisions. While some information can be sourced from external channels such as media coverage, the majority is taken from corporate disclosures. These take many forms: annual and integrated reports, dedicated sustainability reports, ESG data tables on company websites, press releases, among others.

‍

In recent years, the quantity, structure, and quality of this data have improved. Companies are reporting more metrics with more precision, supported by harmonised standards such as the ISSB, as well as growing regulatory requirements and disclosure mandates. However, company reporting timeliness remains a constraint: most disclosures are produced annually and often lag behind the period they cover, setting a baseline limit on how current data can be for all investors.
‍

A compounding issue is the next step - the creation of comparable, investor-ready datasets - which introduces its own delays and challenges. This is the part of the data lag problem that service providers can solve - a necessary even if not sufficient step towards the vision of real-time sustainability data. Extracting and structuring disclosure data at scale, speed, and with the required quality is surprisingly difficult.

There are several reasons why:

Heterogeneous data structures: ESG data is not disclosed with the same discipline and rigidity as financial data. Key metrics like GHG emissions may be reported in tables, images, diagrams, charts, etc.
‍

Low reporting quality: ESG data is reported with a large variance in units and formats. Some companies report in aggregated units (like tonnes CO2e emissions per year), while others break this down into subcategories. And while some companies use Mt to depict metric tonnes, others use the unit for million tonnes. We also observe erroneous use of separators in large numbers and incorrect units across reporting years.
‍

High variance in data comparability: Comparability between companies and across years requires aligned data definitions. Does the existence of a human rights policy require explicit wording around UNGP or OECD Guidelines? What type of language do we accept to prove commitment? Should wage gap numbers be unadjusted or adjusted? Should we use median or average? Slight differences in metric definitions can make standardisation difficult. Furthermore, companies often apply footnotes and limitations to their disclosures that are important to understand the completeness and comparability of a given metric. Companies may also apply changes in methodologies from year-to-year, resulting in restatements of metrics from previous years.
‍

Limited training data: Various attempts at getting large general AI models to extract data at high quality have failed. Investors expect quality rates well-above 95% in ESG disclosure data, and no models can deliver that today. Adding to that, ESG disclosures are most predominant among mid and large cap listed companies, and more frequent in certain markets and industries. This means that the amount of available data for training AI models is limited. Finally, it is costly to create training datasets, as it requires some subject matter expertise in the ESG domain to accurately annotate data.

So whereas identifying the EBITDA figure in an 8-K is something models learned to do many years ago, extracting ESG disclosure data is surprisingly harder than it looks, and has historically struggled to deliver the necessary quality at scale without delays and high costs.

“So whereas identifying the EBITDA figure in an 8-K is something models learned to do many years ago, extracting ESG disclosure data is surprisingly harder than it looks, and has historically struggled to deliver the necessary quality at scale without delays and high costs”

Below, we will review the case for timely disclosure data, present the results of our latest advances in AI and discuss the main components in our approach. Matter covers a broader range of ESG data disclosures, but for the sake of simplicity, we will focus only on GHG emissions metrics as the recurring example in the rest of this piece.

‍

The case for timely carbon data

Investors want freshly-reported emissions data to flow through into datasets and models as fast and as accurately as possible, for several key reasons:

Accurate Transition Risk Pricing: Companies with high carbon intensities face imminent financial hits from carbon taxes, emissions trading schemes (like the EU ETS), and volatile, rising fossil fuel costs. Timely data allows investors to calculate a company’s exact "carbon liability" before it impacts the balance sheet.
‍

Tracking Net-Zero Alignment: Many funds have strict mandates to align with the Paris Agreement or specific interim 2030 decarbonization targets, for example through Climate Transition (CTBs) and Paris-Aligned Benchmarks (PABs). Investors need frequent data updates to verify if a company is actually cutting emissions according to its stated trajectory, rather than relying on empty promises.
‍

Regulatory Compliance and Carbon Accounting: Under regulations like Europe’s CSRD and various global climate disclosure rules, institutional investors must report the funded emissions (financed emissions) of their portfolios annually. They cannot accurately calculate their own Particulate Adverse Impacts (PAIs) or portfolio carbon footprint with stale corporate data.

Identifying Green Pioneers: Rapid drops in a company’s GHG emission can signal operational efficiencies, adoption of cheaper renewable energy, a successful pivot to green product lines, etc.. Timely data helps investors spot these market leaders early.

‍

While timely carbon data can make a real difference, asset managers and owners are effectively forced to manage climate risk blindly when GHG emissions disclosures are delayed. This presents a number of key risks:

The "Carbon Data Lag" Vulnerability: GHG data is notoriously lagging, often published 12 to 18 months after the actual emissions occurred. If an investor relies on a company's 2024 sustainability report to make decisions today, they might unknowingly be holding an asset that significantly increased its coal consumption or suffered a massive methane leak over the past year.

Inaccurate Portfolio Carbon Footprinting: Asset managers also risk publishing incorrect carbon footprints for their funds. If a major holding secretly spiked its Scope 1 emissions, the fund’s overall carbon intensity calculation will be artificially low, leaving the asset manager vulnerable to greenwashing accusations and regulatory audits.

Failure of Climate Benchmark Tracking: Funds mapped to Climate Transition Benchmarks (CTBs) or Paris-Aligned Benchmarks (PABs) are legally required to reduce their portfolio carbon footprint by a set percentage year-over-year (typically 7% annually). Delayed data can cause a fund to miss this decarbonization target, potentially requiring a forced, chaotic divestment of assets later on to re-balance the portfolio.

Misallocated Climate Capital: Delays prevent capital from reaching the companies that are actively innovating to reduce emissions right now. Instead, capital remains tied up in legacy polluters whose true current emissions profile has not yet been exposed to the market.

‍

Due to the general delays in disclosure data, investors today unwillingly face uncertainty around the usefulness of the data they integrate into their models and policies. There is a clear case for applying AI in data extraction, and while automation of ESG data extraction is easier said than done, Matter has achieved some encouraging results that we are now pleased to share.

‍

Reaching the 80% Threshold: Performance Results of the New AI Pipeline

Matter’s AI-powered data extraction pipeline is designed to automate the ingestion and analysis of official corporate disclosures, as well as the extraction, verification and transformation of disclosure data into client-ready metrics. The system has been under development for years, and was released in 2025. While we have gradually improved the automation rate through various initiatives, nothing has made as big an effect as the upgrade of our underlying AI model from Qwen 2.5 to 3.5.

‍

Through this deployment, we have tripled our previous automation rates, now processing more than 80% of ESG metrics entirely without human intervention, at quality levels on par with - and in many cases exceeding - existing data solutions in the market.

‍

The Engine Behind the Efficiency

This leap in operational efficiency is driven by two primary technological pillars:

State-of-the-Art Architecture: The pipeline leverages Qwen 3.5’s 9-billion parameter model, an open-source powerhouse optimized for complex multilingual understanding and high-fidelity contextual extraction.
Proprietary Training Data: A model is only as good as its foundation. We have built an expansive, proprietary training dataset specifically tailored to various ESG frameworks and metrics. We have annotated reported data from companies representing 99% of the global market cap, across multiple reporting years, to ensure representation as well as volume in the training data. In total, this gives us access to training data on tens of thousands of reported instances for a single metric like GHG scope 1 emissions. This specialized training allows the model to locate and extract nuanced figures that standard off-the-shelf LLMs frequently miss.

By minimizing the need for manual review, our pipeline offers the scalability required to solve the issue of lagged data, enabling the coverage of broader investment universes without a linear increase in operational costs.

‍

Closing the Temporal Gap

One of the causes of data lags is that ESG data providers struggle with structural bottlenecks. Manual collection processes mean that critical data points often take several quarters to be compiled at scale, validated, and delivered to clients.

‍

Bypassing these legacy bottlenecks, Matter’s new automated pipeline allows us to deliver validated ESG data as soon as we can ingest a newly published report. This paradigm shift offers institutional investors several core advantages:

Immediate Realignment: Portfolio managers can adjust allocations instantly based on fresh corporate disclosures rather than trading on outdated information. This minimizes the delay to the time it takes a company to measure, estimate, consolidate and report its emissions.
Mitigated Event Risk: Sharp changes in a company’s ESG risk profile can be identified immediately, allowing for rapid risk mitigation.
Enhanced Reporting Accuracy: Client and regulatory reporting reflect the latest reality of underlying fund holdings.

‍

Engineering Through Complexity: Technical Challenges and Strategic Solutions

Building a system capable of accurately parsing often ambiguous and heterogeneous corporate reports is an immense technical hurdle. Below, we outline the three primary challenges faced during development and the engineering frameworks we implemented to overcome them.

‍

Challenge A: Cost-Efficient Retrieval, Extraction, and Validation

Running massive large language models over thousands of multi-page corporate documents can quickly become cost-prohibitive.

‍

The Solution: We designed a hybrid, multi-stage pipeline. Instead of passing entire documents to the LLM, we use efficient, lightweight retrieval algorithms to isolate the exact pages and passages relevant to specific ESG metrics. Another model is then selectively deployed only for the final extraction, while a third model validates, significantly reducing compute, while ensuring high quality.

‍

Challenge B: Future-Proof Infrastructure for Model Lifecycle Management

The open-source AI ecosystem moves at a breakneck pace. Infrastructure that binds itself tightly to a single model architecture risks obsolescence within months.

‍

The Solution: We built a highly modular, decoupled distributed infrastructure utilizing containerized environments and agnostic building blocks to ensure seamless scalability, ease of maintenance, and rapid build speeds. This architecture allows our data science team to seamlessly test, fine-tune, and swap out open-source models with zero disruption to the production environment. When Qwen 3.5 was integrated, this flexible framework enabled a smooth and rapid transition from our legacy models.

‍

Challenge C: Managing Domain-Specific Edge Cases and Unstructured Disclosures

Due to the nature of ESG data, a company might report emissions in metric tons in a footnote on page 40, while another uses a completely different unit inside a graphical chart on page 5. Pure AI models often struggle with these nuanced edge cases.

‍

The Solution: We layered a robust, domain-specific business logic engine on top of our AI extraction layer. While the AI handles the heavy lifting of raw contextual extraction, our programmatic business logic applies strict validation rules, cross-checks units of measurement, handles corporate structure mapping, and routes highly anomalous data points to human analysts for edge-case verification.

‍

Guarding the Source of Truth: The Imperative of Human-in-the-Loop Quality Control

While reaching an 80% automation rate is a significant technological milestone, velocity means nothing without veracity. In institutional investing, trading or modeling on erroneous data can lead to financial, regulatory, and reputational repercussions. Therefore, maintaining high data integrity and demonstrable quality is the foundational pillar of our extraction platform.

‍

The Pitfalls of Pure Automation

Relying exclusively on generative AI models for financial and ESG accounting introduces a dangerous double-edged sword:

The Coverage Penalty: When AI models encounter highly idiosyncratic reporting standards or ambiguous phrasing, their confidence scores drop. If a system relies solely on the model, it must discard these data points, drastically reducing overall data coverage for issuers with more difficult disclosures.
The Quality Penalty (Hallucinations): Conversely, if strict confidence thresholds are not enforced, models may "hallucinate", generating plausible-sounding but inaccurate numbers.

In the ESG domain, where standardization is still evolving, a fully autonomous system inevitably compromises either the breadth of its universe or the accuracy of its insights.

‍

The Solution: A Symbiotic Human-in-the-Loop (HITL) Framework

To resolve this tension, we have engineered a hybrid architectural framework that seamlessly blends artificial intelligence with human expertise.

When our pipeline processes a corporate document, the system assigns a granular confidence score to every extracted metric. Any data point that falls below our stringent quality threshold, or triggers a business logic flag, is automatically routed to an intuitive human-in-the-loop validation platform.

Domain experts review the flagged context, verify or correct the data point, and authorize its release. This ensures that clients receive validated, high-quality data without sacrificing coverage on complex edge cases.

‍

The Data Flywheel: Turning Exceptions into Automation

The true power of this framework lies in its continuous feedback loop. The human-in-the-loop intervention is not just a reactive safety net; it is an active engine for system optimization.

The Data Flywheel Effect: Every time a human analyst resolves a complex edge case or corrects an ambiguous disclosure, that specific scenario is labeled, structured, and fed back into our proprietary training dataset.

As this dataset grows, the AI model is continuously fine-tuned on the very edge cases that previously caused it to stumble, leveraging hard negatives to significantly improve model performance. Over time, this systematically expands the model’s capabilities, driving our automation rates even higher while maintaining an uncompromising standard of data quality for our clients

‍

The First-Mover Advantage in Carbon Strategies

Access to timely emissions data transforms an investor's capabilities in several distinct ways:

Reducing compliance risk: Fund managers can reduce the risk of making claims about carbon intensity that are outdated, especially around the critical SFDR reporting deadline at the end of June, which is shortly after most sustainability reports are published.
Front-Running the Market: Investors can identify carbon-efficient leaders or high-risk laggards weeks before the broader market updates its models, locking in advantageous pricing.
Index Optimization: Quantitative strategies relying on climate indices can execute rebalancing trades ahead of index tracking adjustments, reducing tracking error and transaction costs.

‍

By reducing the time-to-market of critical ESG disclosures, our progress can help investors strengthen compliance with reporting requirements, adjust strategies to climate benchmarks ahead of competitors and act more quickly in an increasingly data-driven sustainability market.

In the rapidly evolving landscape of sustainable finance, the ability to process vast quantities of environmental, social, and governance (ESG) data with speed and precision is a critical differentiator. We are proud to present the results of our newly deployed, AI-powered data extraction pipeline—a system designed to automate the ingestion and analysis of official corporate disclosures.
‍

‍

Timely ESG disclosure data is hard to come by

‍

There are several reasons why:

“So whereas identifying the EBITDA figure in an 8-K is something models learned to do many years ago, extracting ESG disclosure data is surprisingly harder than it looks, and has historically struggled to deliver the necessary quality at scale without delays and high costs”

‍

The case for timely carbon data

Investors want freshly-reported emissions data to flow through into datasets and models as fast and as accurately as possible, for several key reasons:

‍

Reaching the 80% Threshold: Performance Results of the New AI Pipeline

‍

The Engine Behind the Efficiency

This leap in operational efficiency is driven by two primary technological pillars:

State-of-the-Art Architecture: The pipeline leverages Qwen 3.5’s 9-billion parameter model, an open-source powerhouse optimized for complex multilingual understanding and high-fidelity contextual extraction.
Proprietary Training Data: A model is only as good as its foundation. We have built an expansive, proprietary training dataset specifically tailored to various ESG frameworks and metrics. We have annotated reported data from companies representing 99% of the global market cap, across multiple reporting years, to ensure representation as well as volume in the training data. In total, this gives us access to training data on tens of thousands of reported instances for a single metric like GHG scope 1 emissions. This specialized training allows the model to locate and extract nuanced figures that standard off-the-shelf LLMs frequently miss.

‍

Closing the Temporal Gap

‍

Immediate Realignment: Portfolio managers can adjust allocations instantly based on fresh corporate disclosures rather than trading on outdated information. This minimizes the delay to the time it takes a company to measure, estimate, consolidate and report its emissions.
Mitigated Event Risk: Sharp changes in a company’s ESG risk profile can be identified immediately, allowing for rapid risk mitigation.
Enhanced Reporting Accuracy: Client and regulatory reporting reflect the latest reality of underlying fund holdings.

‍

Engineering Through Complexity: Technical Challenges and Strategic Solutions

‍

Challenge A: Cost-Efficient Retrieval, Extraction, and Validation

Running massive large language models over thousands of multi-page corporate documents can quickly become cost-prohibitive.

‍

Challenge B: Future-Proof Infrastructure for Model Lifecycle Management

The open-source AI ecosystem moves at a breakneck pace. Infrastructure that binds itself tightly to a single model architecture risks obsolescence within months.

‍

Challenge C: Managing Domain-Specific Edge Cases and Unstructured Disclosures

‍

Guarding the Source of Truth: The Imperative of Human-in-the-Loop Quality Control

‍

The Pitfalls of Pure Automation

Relying exclusively on generative AI models for financial and ESG accounting introduces a dangerous double-edged sword:

The Coverage Penalty: When AI models encounter highly idiosyncratic reporting standards or ambiguous phrasing, their confidence scores drop. If a system relies solely on the model, it must discard these data points, drastically reducing overall data coverage for issuers with more difficult disclosures.
The Quality Penalty (Hallucinations): Conversely, if strict confidence thresholds are not enforced, models may "hallucinate", generating plausible-sounding but inaccurate numbers.

In the ESG domain, where standardization is still evolving, a fully autonomous system inevitably compromises either the breadth of its universe or the accuracy of its insights.

‍

The Solution: A Symbiotic Human-in-the-Loop (HITL) Framework

To resolve this tension, we have engineered a hybrid architectural framework that seamlessly blends artificial intelligence with human expertise.

‍

The Data Flywheel: Turning Exceptions into Automation

The true power of this framework lies in its continuous feedback loop. The human-in-the-loop intervention is not just a reactive safety net; it is an active engine for system optimization.

‍

The First-Mover Advantage in Carbon Strategies

Access to timely emissions data transforms an investor's capabilities in several distinct ways:

Reducing compliance risk: Fund managers can reduce the risk of making claims about carbon intensity that are outdated, especially around the critical SFDR reporting deadline at the end of June, which is shortly after most sustainability reports are published.
Front-Running the Market: Investors can identify carbon-efficient leaders or high-risk laggards weeks before the broader market updates its models, locking in advantageous pricing.
Index Optimization: Quantitative strategies relying on climate indices can execute rebalancing trades ahead of index tracking adjustments, reducing tracking error and transaction costs.

‍

Accelerating ESG data delivery

Timely ESG disclosure data is hard to come by

The case for timely carbon data

Reaching the 80% Threshold: Performance Results of the New AI Pipeline

The Engine Behind the Efficiency

Closing the Temporal Gap

Engineering Through Complexity: Technical Challenges and Strategic Solutions

Challenge A: Cost-Efficient Retrieval, Extraction, and Validation

Challenge B: Future-Proof Infrastructure for Model Lifecycle Management

Challenge C: Managing Domain-Specific Edge Cases and Unstructured Disclosures

Guarding the Source of Truth: The Imperative of Human-in-the-Loop Quality Control

The Pitfalls of Pure Automation

The Solution: A Symbiotic Human-in-the-Loop (HITL) Framework

The Data Flywheel: Turning Exceptions into Automation

The First-Mover Advantage in Carbon Strategies

Author

Timely ESG disclosure data is hard to come by

The case for timely carbon data

Reaching the 80% Threshold: Performance Results of the New AI Pipeline

The Engine Behind the Efficiency

Closing the Temporal Gap

Engineering Through Complexity: Technical Challenges and Strategic Solutions

Challenge A: Cost-Efficient Retrieval, Extraction, and Validation

Challenge B: Future-Proof Infrastructure for Model Lifecycle Management

Challenge C: Managing Domain-Specific Edge Cases and Unstructured Disclosures

Guarding the Source of Truth: The Imperative of Human-in-the-Loop Quality Control

The Pitfalls of Pure Automation

The Solution: A Symbiotic Human-in-the-Loop (HITL) Framework

The Data Flywheel: Turning Exceptions into Automation

The First-Mover Advantage in Carbon Strategies

Highlights

From Mitigation to Transition

SFDR 2.0: Implications for Institutional Investors and Pathways Forward

Regulatory headwinds for sustainable investors

Enter your details

Accelerating ESG data delivery

Timely ESG disclosure data is hard to come by

The case for timely carbon data

Reaching the 80% Threshold: Performance Results of the New AI Pipeline

The Engine Behind the Efficiency

Closing the Temporal Gap

Engineering Through Complexity: Technical Challenges and Strategic Solutions

Challenge A: Cost-Efficient Retrieval, Extraction, and Validation

Challenge B: Future-Proof Infrastructure for Model Lifecycle Management

Challenge C: Managing Domain-Specific Edge Cases and Unstructured Disclosures

Guarding the Source of Truth: The Imperative of Human-in-the-Loop Quality Control

The Pitfalls of Pure Automation

The Solution: A Symbiotic Human-in-the-Loop (HITL) Framework

The Data Flywheel: Turning Exceptions into Automation

The First-Mover Advantage in Carbon Strategies

Author

Timely ESG disclosure data is hard to come by

The case for timely carbon data

Reaching the 80% Threshold: Performance Results of the New AI Pipeline

The Engine Behind the Efficiency

Closing the Temporal Gap

Engineering Through Complexity: Technical Challenges and Strategic Solutions

Challenge A: Cost-Efficient Retrieval, Extraction, and Validation

Challenge B: Future-Proof Infrastructure for Model Lifecycle Management

Challenge C: Managing Domain-Specific Edge Cases and Unstructured Disclosures

Guarding the Source of Truth: The Imperative of Human-in-the-Loop Quality Control

The Pitfalls of Pure Automation

The Solution: A Symbiotic Human-in-the-Loop (HITL) Framework

The Data Flywheel: Turning Exceptions into Automation

The First-Mover Advantage in Carbon Strategies

Highlights

From Mitigation to Transition

SFDR 2.0: Implications for Institutional Investors and Pathways Forward

Regulatory headwinds for sustainable investors

Related article

Want to hear more from Matter?