Join Our Discord (750+ Members)

Data-Centric Governance

Content License: cc-by-sa

Data-Centric Governance

Papers is Alpha. This content is part of an effort to make research more accessible, and (most likely) has lost some details from the original. You can find the original paper here.


An emerging set of guidelines originating in both the publicand private sectorshas advanced varying perspectives on what constitutes responsible artificial intelligence (AI). These guidelines typically focus on qualitative properties of the AI system and their assurance through human application of processes, such as stakeholder consultations, that shape the requirements of the AI system and evaluate its compliance with these requirements. Following these human processes, AI systems are deployed to the real world where they operate at large scale and high speed. Human staff must then play catch-up to AI systems whose operating environment and consequent behavior are constantly changing. The result is an increasing number of AI incidentswhere people are unfairly impacted or even killed by AI systems that fail to conform to their nominal governance requirements when deployed. These incidents principally occur due to a gap between identifying the governance requirements of a system and applying them in the development and deployment of AI systems. In short, AI governance unmoored from the engineering practice of AI inevitably leads to unexpected, damaging, or dangerous outcomes.

Modern AI systems are inherently data-centric – they are produced from data then deployed to the real world to make decisions on input data. Given the centrality of data to AI systems, governance requirements are most comprehensively assured when they are defined and evaluated with data. Through appropriate preparation of governance datasets, it becomes possible to produce systems of “continuous assurance.” In analogy to continuous integration and deployment in software engineering, continuous assurance integrates evaluation of governance requirements at every stage of the product lifecycle by formulating requirements verification as a low-friction, repeatable algorithmic process.

Continuous verification and validation of system governance requirements.

Continuous assurance places AI systems into an operating scope coinciding with governance requirements. In the absence of data defining a system of continuous assurance, intelligent systems continue operating even when they begin violating governance requirements. An intelligent system that is not governed with data, is one that is not governed.

Formulating governance requirements in terms of data implies operational changes over the whole life cycle of a system. When governance is left to a final gate at delivery time, any violation of governance requirements presents a choice of either waiting months for an updated system or deploying the solution in an absence of a fix. Through the adoption of governance requirements as solution requirements during engineering, it becomes possible to realize the benefits of good governance (i.e., a better product) while greatly reducing the downstream compliance risk. Consequently, while we are presenting an approach to satisfying emerging governance requirements of AI systems, we are primarily concerned with how product teams can best move governance from a deployment barrier to a core system specification enabling better solutions.

In this work we begin by detailing the elements of data-centric governance before introducing the teams and data involved in its application, then we step through a series of technical problems and incidents (i.e., harm events) that are identified via the team structure and data. We close the paper with details on the computer systems involved in its application.

What is the one insight everyone should take away from this paper?

Shipping better products, faster, and with fewer risks requires embedding governance requirements throughout the product life cycle

Data-Centric Governance

Data-centric governance means operationalizing performance requirements for AI systems in the form of datasets and algorithmic evaluations to be run against them. It turns abstract requirements like “fairness” into objectively measurable phenomena. It recognizes the central role of data in AI system evaluation and the need for good data stewardship and rigorous evaluation protocols to preserve the statistical validity of evaluations against that data.

By one estimate, as many as 85 percent of AI projects fail to deliver or fall short. In our experiences in research and industry, the most common failure point has been failure to appropriately capture the full complexity of the solution’s deployment environment and engineer a solution accordingly. As governance processes are often aimed at surfacing potential deployment harms, their systematization earlier in the product lifecycle is likely to substantially reduce the failed delivery rate.

In this section, we lay out the key tasks involved in implementing data-centric governance in practice. Accomplishing these tasks requires a concerted effort from the people involved in all stages of the product lifecycle. We will have more to say about the impact of organizational structure on governance in Section sec-teams. For now, we identify four major areas of responsibility:

The people responsible for defining the system’s goals and requirements.

The people responsible for collecting and preparing the data necessary for system engineering and evaluation.

The people responsible for engineering the product.

The people responsible for ensuring that the solution is consistent with organizational, regulatory, and ethical requirements prior to and following deployment.

These teams can be staffed by people from a single organization producing a system, or their functions can be represented by external organizations (e.g., auditors can serve as a verification team).

Throughout this section, we will highlight the benefits of effective data-centric governance by contrasting the experiences of these development teams at two fictitious organizations: Governed Corporation (G-Corp), which follows data-centric governance best practices, and Naïve Corporation (N-Corp), which does not.

Operationalize system requirements

Data-centric governance is about turning abstract governance requirements into objectively and algorithmically measurable phenomena. Machine learning practitioners are already familiar with this idea. Test set error, for example, is commonly taken to be the operational definition of task success. But there are many other relevant dimensions of performance. Data-centric governance envisions that measuring the full breadth of relevant performance criteria should be as routine as measuring test set accuracy.

To make this possible, one must specify how to measure performance algorithmically, which for an AI system includes provisioning the evaluation datasets. This creates an objective, repeatable measurement process.Although emerging requirements at deployment time can necessitate collecting additional evaluation data, performance requirements, like all product requirements, should be defined as early in the development process as possible. Ill-defined requirements are a major cause of project failures in all engineering disciplines, and AI is no exception. Nevertheless, it is common for product requirements to be expressed only qualitatively in the design phase, with quantitative measures coming later after significant engineering has already been done. This is a harmful practice, as it tends to anchor quantitative requirements to what is measurable with the available data and to the numbers that the chosen engineering approach can achieve. It is difficult to be objective about what would constitute acceptable performance once expectations are anchored in this way.

Codifying governance requirements with data also enables practitioners to learn from past mistakes. Every AI system incident is an opportunity to identify and formalize a new relevant dimension of performance. Once formalized, it is possible to evaluate future systems for their susceptibility to similar incidents. An AI system’s scores on different batteries of data-driven evaluations rigorously characterize the boundaries of its competencies and allow teams to make informed decisions about whether to deploy the system.

Contrasting Outcomes

G-Corp. The Product Team identifies that the AI system poses a risk of disparate impacts and adds “fairness across demographic groups” as a product requirement. They define “fairness” as having statistically-equivalent error rates across relevant demographic groups. The Data Team creates engineering and evaluation datasets stratified by demographic groups. The Solution Team optimizes the system to achieve all performance goals on the engineering data. The Verification Team verifies compliance with fairness requirements using the evaluation data and confirms that the system does not produce disparate impacts – the system is safely deployed.

N-Corp. The Product Team identifies that the AI system poses a risk of disparate impacts and adds “fairness across demographic groups” as a product requirement. The Data Team focuses on engineering and evaluation data for the primary task, and the Solution Team focuses on optimizing for the primary task. The Verification Team checks that the AI system does not use sensitive features as inputs, but has no means of detecting that the model produces disparate impacts because a feature for user location, which was not considered sensitive, is correlated with sensitive attributes. The system is deployed and results in disparate impacts to real-world users.

Solve the data paradox

Data-centric governance faces a bootstrapping problem in acquiring the necessary datasets. The reliability of the system cannot be assessed without data that is representative of the actual deployment environment. The obvious way to obtain this data is to collect it from a deployed system. But, without the data, it is impossible to engineer a deployable system or to verify that a release candidate meets its minimum deployment requirements. This is the data paradox of data-driven AI systems. The path forward consists of two parts: acquiring realistic proxy data, and expanding the scope of the system iteratively with real deployment data from previous system releases.

Acquire realistic proxy data. Where real-world data is not available, suitably realistic proxy data must be acquired. There are many ways to approach this. For some applications, it may be possible to mine publicly available data collected for other purposes. For example, data for image classification is often sourced from websites like Flickr by searching for relevant keywords. Data that requires more human expertise to generate or label is often obtained from crowdsourcing services like Amazon Mechanical Turk.

Data collection strategies can be quite elaborate. For example, Google ran a telephone answering service from 2007 to 2010 primarily for the purpose of collecting data for speech-to-text transcription. The sheer amount of effort expended to collect this data should reinforce the value of large, realistic datasets for product creation. Organizations should consider carefully the scale of data required for creating the initial version of the product and the level of realism required, and should budget appropriately for the complexity of acquiring this data.

Expand scope iteratively. Once version 1.0 of a system is created and deployed, it begins to see real-world data from its actual deployment environment. This data should be collected and used to expand the datasets that will be used for creating the next version of the system. Through this bootstrapping process, the solution will gradually improve as larger amounts of real deployment data become available.

The scope of the AI system must be tightly restricted during the initial phases of this bootstrapping process. With limited deployment data available, the scope within which the system’s competencies can be verified through evaluation data is also limited, and therefore it must be prevented from operating in circumstances where the risk of harm cannot be measured. The scope of the system should begin small and expand gradually, informed by continuing evaluations of performance in the real deployment environment.

Contrasting Outcomes

G-Corp. The Product Team produces a grand vision for a device that will change the world and plans shipments to 30 countries. The Data Team finds appropriate data sourced from a single country and the Solution Team begins engineering. Knowing the data will not support shipments to 30 countries, the Go2Market strategy shifts to running pilot programs with the production device in 29 of the 30 markets. The Verification Team signs off on shipping to 1 country and the product is a huge hit – driving enrollment in the pilot program in the remaining 29 countries.

N-Corp. The Product Team produces a grand vision for a product that will change the world and justifies its multi-million dollar development budget on shipments to 30 countries. The Data Team finds appropriate data sourced from a single country and the Solution Team begins engineering. The Verification Team then is overruled when attempting to block shipments to all but the represented country. N-Corp is worried G-Corp will be first-to-market. After poor performance in 29 of the 30 markets (including several newsworthy incidents), the product is recalled globally – including in the one strong-performing market.

Steward the data

Implementing data-centric governance requires that some entity take responsibility for data stewardship. This includes collecting and storing the data, making it available for evaluations while guarding it against misuse, and performing data maintenance activities to ensure the continued construct validity of data-centric measures as the application domain changes. Data stewardship is a shared responsibility of the Data Team and the Verification Team. The Data Team decides what data is needed to measure a given construct and how to obtain it, and the Verification Team ensures that evaluations against the data are conducted properly.

Preserve diagnostic power. Exposing evaluation data to Solution Teams compromises the validity of evaluations based on that data. Even if Solution Teams exercise proper discipline in not training on the test data, something as innocent as comparing evaluation scores of multiple models can be a step down the road to overfitting. Practitioners may not be aware of all of the subtleties of statistically rigorous evaluation, and even when they are, some of the more “pedantic” requirements may be seen as unnecessary impediments to speedy delivery of a solution.

There is also the practical problem that without data access controls, it is not possible to verify that the evaluation data has not been misused. This is especially important when the evaluation is an “industry standard” benchmark, where showing state-of-the-art performance may bring prestige or financial benefits. The Verification Team is responsible for facilitating data-driven evaluations in a way that preserves the validity of the evaluation data as a diagnostic tool. The evaluation data must be kept private and the release of evaluation scores must be controlled so as not to reveal information about the data.

Contrasting Outcomes

G-Corp. The Data Team creates evaluation datasets that are collected independently of all engineering datasets. The Data Team delivers the evaluation data to the Verification Team and does not disclose the data or the method of collection to the Solution Team. The Solution Team passes their trained models off to the Verification Team, which evaluates the systems and reports the results in a way that avoids disclosing information about the evaluation data. The organization can be confident that the solution has not overfit the evaluation data, and thus that the evaluation results are reliable.

N-Corp. The Data Team delivers both the engineering data and the evaluation data to the Solution Team. The Solution Team knows the risks of overfitting the evaluation data, but under pressure to improve performance, they make changes to improve the model guided by its scores on the evaluation data. The Verification Team verifies that the system meets performance requirements and approves it for deployment. The system under-performs after deployment because architecture changes driven by repeated evaluations resulted in overfitting the evaluation data.

Maintain the data. The world changes over time, and with it changes the distribution of inputs that a deployed AI system will be asked to process. A computer vision system that tracks vehicles on the road, for example, will see new vehicle models introduced throughout its lifetime. Evaluation data must be maintained to ensure that it continues to be a valid operational measure of the associated performance requirement. This will usually require, at least, periodically collecting additional data, and possibly also pruning obsolete data or designing new data augmentations.

Data maintenance is a joint activity of the Data Team and the Verification Team. The Verification Team should conduct ongoing evaluations to look for signs of domain shift and alert the Data Team when additional evaluation data is needed. This likely requires collaboration with the Solution Team to ensure that the system emits the necessary diagnostic information after deployment. Once alerted, the Data Team should create new or modified datasets and pass them back to the Verification Team for integration into the evaluation process.

Contrasting Outcomes

G-Corp. A development team is creating an image classification app meant to run on smartphones. Whenever a new model of smartphone is released by a major manufacturer, the Data Team collects engineering and evaluation datasets comprised of images of a standard set of subjects captured with the new model of phone. Periodic system revisions include data from new phone models in their engineering and evaluation data. The app maintains high performance over time.

N-Corp. After deployment of the first version of the app, the Data Team considers their work complete and moves on to other projects. The app begins to perform poorly with the latest models of the phone. The Solution Team attempts to improve the model, but improving performance on the existing data seems to make performance with the new phones even worse. Users of newer phone models stop using the app because of its poor performance.

Evaluation Authorities. In many instances data stewardship is best performed by evaluation authorities tasked with assessing system impact requirements. Evaluation authorities standardize entire product categories with independent measurements. For example, stamped on the bottom of most AC adapters are the letters “UL Listed,” which stands for Underwriters Laboratories – an organization that has worked to test and standardize electrical safety since 1894. Without organizations like UL, electricity would be far too dangerous to embed in the walls of all our homes and businesses. People will not buy electrical systems in the absence of electrical standards. Similarly, intelligent systems are often not purchased because the purchaser has no way of efficiently and reliably determining the capacities of the system.

Contrasting Outcomes

After producing Résumé screening apps that are measurably far fairer than any human screener, G-Corp and N-Corp find they cannot sell their products to anyone because nobody trusts the performance numbers. Both firms engage third party auditors to evaluate the technology.

G-Corp. G-Corp pays the auditor a nominal premium on their normal audit price to subsequently serve as an “evaluation authority” for the market segment. As the first public standard of its kind, the entire human resources industry soon standardizes around it and G-Corp develops a performance lead from having been there first.

N-Corp. N-Corp’s solution performs just as well as G-Corp’s, but their measures are soon rendered irrelevant after other competitors standardize to the G-Corp audit. Since competitors cannot compare against N-Corp’s numbers, G-Corp wins the market.

Adopt continuous assurance

As software engineering practice has shown, the best way to ensure that requirements are met is to verify them automatically as a matter of course after every system modification and throughout the lifecycle of the system. In analogy to the continuous integration and continuous deployment (CI/CD) paradigm that has transformed the practice of software engineering, we need continuous assurance practices for AI system engineering. This section lays out the core components of continuous assurance; we discuss tooling for continuous assurance in Section sec-outline-systems.

Extend CI/CD to AI systems. The components of continuous assurance for AI systems mirror the components of CI/CD in many respects. Both CI and CD gate development and release via test suites analogous to evaluation data. To utilize the test suite, the system must be designed for testability. For an AI system, this means that the system in its deployable form must expose a generic interface enabling it to consume data from arbitrary sources and produce results. The system should also be modular with loose coupling among components so that components can be tested separately. Unfortunately, modern machine learning techniques have trended toward “end-to-end” monolithic models that are difficult to separate into components. The reason for this trend is that such models often perform somewhat better than modular architectures, but solution engineers must be aware that this performance comes at the price of testability. Recent interest in “explainable AI” is in part a reaction to this trend, acknowledging the need to understand the intermediate computational steps implicit in the model.

In addition to testable models, we need the computational infrastructure to run the tests at scale. This is already a well-known pain point in CI/CD, and there are many companies offering solutions like cloud-based distributed build and test systems. The problem may be even more acute for AI systems due to the computational expense of running large evaluation datasets. Such problems can be overcome, but organizations developing AI systems must understand the problems and plan and budget accordingly.

Finally, just like CI/CD, continuous assurance requires an associated versioning system for AI models. Because AI models are products of training on data, the versioning system must also track versions of the training data and other parameters of the training pipeline, and record which version of the training inputs produced which version of the model.

Contrasting Outcomes

G-Corp. The development teams implement a continuous assurance process in which deployment candidate models are checked in after training and run against the evaluation suite automatically. This allows them to notice that the new version of the model, which has more parameters, has better task performance but is less robust to noise. The Solution Team improves the model by adding regularization, and the improved model passes all evaluations.

N-Corp. The Verification Team conducts performance evaluations in the traditional way, by receiving the model from the Solution Team and manually performing ad hoc evaluations. They note its improved performance on the primary task evaluation data, but they do not run the noise robustness tests because these take extra effort and they were run for the previous model version. The model is deployed, where it performs poorly in noisy environments.

Monitor deployed systems and report incidents. While the data-centric governance practices we have discussed so far offer teams the best chance of deploying reliable, trustworthy systems, training and verification data will always be incomplete, and the world will always continue to change after the AI system is deployed. AI systems therefore require ongoing monitoring after deployment.

One objective of ongoing monitoring is to detect and report excursions from the AI system’s proper operating scope as they happen. Techniques like out-of-distribution detection should be applied to compare real-world inputs and outputs to those in the engineering and evaluation datasets. Departures from the expected distributions of data could mean that the system is operating in novel regimes where it may not be reliable. Timely detection of these departures can allow human decision-makers to place limits on the system or withdraw it from operation if its reliability cannot be guaranteed.

A second objective is to collect real-world inputs and outputs so that they can be used to augment and improve engineering and evaluation datasets. Real-world edge cases and incidents should be tracked to build edge case and incident datasets so that the system can be improved to handle these cases. Continuous assurance processes should incorporate acquired real-world data to ensure that revisions of the AI system handle new edge cases and incidents and do not regress in their handling of known cases. Accumulating real-world data also guards against domain shift by ensuring that engineering and evaluation datasets are up-to-date with the changing world.

Contrasting Outcomes

G-Corp. The Solution Team developing a wake word detection system includes an out-of-distribution (OOD) detection component in the system. During deployment, the OOD detector sends an email alert to the Verification Team indicating that the input data distribution is substantially different from the evaluation data distribution. By analyzing the real-world data collected by the system, the Verification Team determines that the engineering and evaluation datasets do not contain enough variation in speaker accents. They report this to the Data Team, who use the collected data to build more-diverse datasets, improving performance of the next system revision.

N-Corp. Without any monitoring components in place, the developer teams are unaware that the deployed system is operating with a different input distribution than the ones used for engineering and evaluation. Their first indication is when customers with certain accents begin reporting poor performance to customer service. The developers eventually realize the problem, but because data from the deployed system was not collected, the Data Team must collect more-diverse data in a lab setting, at considerable expense. Some customers who were frustrated by poor performance switch to a competitor’s product.

Organizational Factors in Data-centric Governance

While we have defined data-centric governance in terms of data and algorithms, governance processes ultimately are implemented by humans. Effective governance requires a proper separation of concerns among development teams and proper alignment of their incentives to the goals of governance. Misconfiguring these teams introduces perverse incentives into the governance of the system and renders governance efforts ineffective. In this section, we advocate for a governance structure consisting of four teams with distinct goals and areas of responsibility – the Product Team, the Data Team, the Solution Team, and the Verification Team.

The Product Team

The product team is the first team involved in producing an AI solution. Their objective is to direct the purchase or production of a system solving a specific problem by clearly defining the system’s goals and requirements. Product teams often serve as advocates for the customer’s interests when discussing requirements within an organization, which can introduce tensions between teams.

The phrase “goals and requirements” has special meaning in the AI community. Most AI systems are produced by an optimization process that repeatedly updates system configurations to better satisfy a particular performance measure. So, while product team activities determine what gets built, their decisions are also integral to how the solution will be built through optimization, since they effectively design the target metric to be optimized. Thus, when governance requirements are added after the product definition, it reopens the entire solution engineering process.

Can you figure out the system requirements during solution engineering? In contrast to typical software development processes that increasingly plan through iteration, product definition for AI systems is intricately linked with the possibilities afforded by data that are time consuming and expensive to collect. A failure to rigorously define the system profoundly impacts system capabilities and appropriate operating circumstances. Ideally projects will be perfectly scoped to the “must-have” requirements, otherwise when mis-scoped the system will:

  • Over-scope. Underperform on core tasks and data/compute requirements increase
  • Under-scope. Perform poorly on unstated requirements

Tightly defining system requirements greatly reduces program risks.

The boundaries of possibility for AI systems are currently determined more by the availability of data for the task than by the capacities of current AI techniques. Thus the product team must work closely with the data team.

The Data Team

The data team is responsible for collecting and preparing the data necessary for system engineering and evaluation. Data teams are populated with subject matter experts (SMEs) and data engineers. For instance, when producing a system that identifies cancers in medical images, the SMEs are responsible for applying their expert judgment to generate metadata (e.g., drawing an outline around a cancer in an image and labeling it “carcinoma”). Data engineers build the user interfaces for the SMEs to manage metadata on the underlying data (e.g., display radiographs with labels) and maintain the library of data for use by the solution and verification teams described below.

As the datasets required for producing a solution expand, the size of the data team must also increase, often to the point where they outnumber all the other teams. Anecdotally, the most common failure point we observe in companies staffing AI engineering efforts is to place solution engineers on a problem without budgeting or staffing dataset preparation. The circumstance is then analogous to hiring a delivery driver without providing them with a vehicle: their only option is to walk the distance.

When applying data-centric governance, the data team operates in a service capacity for the product, verification, and solution teams to produce several interrelated data products. We will introduce these data products after introducing the solution and verification teams.

The Solution Team

The solution team is responsible for engineering the product. They often receive most of the public recognition when a machine learning research program makes a breakthrough, but research programs rarely produce product deployments outside of research contexts. After establishing what is possible via research, solution teams turn to making a system that can perform its task comprehensively according to the requirements adopted by the product team. Often this involves expanding the dataset requirements to cover the entirety of the system input space. Working with the “edge cases” provided by the data team occupies the vast majority of deployment solution engineering. Until edge cases are handled appropriately, it is the prerogative of the verification team to block solution deployment.

What if the solution is not known? Projects with significant uncertainties are research projects. Training requirements, edge cases, achievable performance, and operating conditions are often unknowable prior to research prototyping. Successful completion of a proof-of-concept is thus a prerequisite to formalizing governance requirements. Research reduces uncertainties allowing subsequent engineering and governance processes to be applied. We recommend contacting an institutional review board (IRB) for research program governance requirements.

Separate research programs from producing shipped intelligent systems.

The Verification Team

The verification team is responsible for ensuring that the solution is consistent with organizational, regulatory, and ethical requirements prior to and following deployment. This definition combines the remit of several teams operating in industry, including those responsible for quality assurance, test, verification, validation, compliance, and risk. As intelligent systems are increasingly subject to regulatory requirements, the Chief Compliance Officer or General Counsel office is often brought in to run compliance processes. However, as traditionally instituted, these offices are not capable of implementing a governance program without the assistance of an engineering department or outside consultants. Alternatively, firms are constituting special-purpose teams tasked with various aspects of AI assurance, such as Google’s team assessing compliance with the corporate AI principles. Such teams require cross functional connections to be successful.

For the purpose of this position paper, we will assume the verification team either has people in-house or consults with people that know the risks of a product, including how to assess the likelihood a system will produce harms at initial deployment time and as the system and world continue to develop. From the perspective of the verification team, well-executed data-centric governance makes the final authorization to deploy an intelligent system perfunctory since all the governance processes will have been carried out prior to final verification.

What happens if you combine teams? The interests of one team will come to dominate the interests of the other team in the combination.

  • Product + Verification: The verification team is responsible for telling the product team when a solution can ship. Product teams typically want fewer limitations and are closer to revenue sources so they tend to dominate in commercial organizations.
  • Product + Data: Similarly, when product responsibilities are embedded within the data team, the data team will tend to prioritize the product team’s interests, which typically means more focus on data for the solution team and less for the verification team.
  • Product + Solution: The product team wants the best, highest performing solution possible while the solution team wants to meet requirements as quickly as possible. If the product team dominates, then the requirements of the system may be unreasonably high – which can result in missed deadlines, extreme data requirements, and more. Should the solution team come to dominate, then the product definition will tend to be scoped around what is more immediately achievable – a form of “bikeshedding”.
  • Data + Verification: The resources of the data team are not infinite. If the data and verification teams are combined, then the verification team will receive rich and comprehensive measures for the system while the solution team will not receive attention for improving those measures. By separating the data team from both the verification and solution team, it is possible to seek a balance.
  • Data + Solution: Data used for requirements verification must not be disclosed to the solution team. When data and solution teams combine, it is difficult to know whether the integrity of the withheld datasets has been violated. High performance may be entirely illusory. More details on this problem are presented later in the paper.
  • Solution + Verification: The verification team determines when the solution team has met its requirements. If these teams are combined, there is a tendency to change requirements to match what the system is capable of.

Separate the four teams and ensure they are evaluated according to their disparate purposes.

Evaluation Authorities

All the technologies for data-centric governance exist and are aligned to simultaneously make better products with more socially beneficial outcomes. What is missing at present is the organizational capacity to build and maintain the tests and requirements. While this can and should be done within the same organizations that are creating the products, there is a real concern that organizations seeking to rapidly deploy products to capture markets will exert pressure on evaluation teams to certify those products prematurely.

While such potential conflicts of interest exist in many fields, they are especially acute in data-driven AI because publicly releasing the data needed to evaluate the system destroys the data’s validity as an evaluation tool. Unlike, for example, automobile crash testing, where it would be very difficult to “cheat” a properly constructed test, in AI it is often trivial to achieve high performance on any given evaluation simply by training on the test data.

These considerations prompt us to advocate for the establishment of evaluation authorities – independent entities that perform data-driven evaluations as a service and who accept responsibility for proper stewardship of the evaluation data. Such independent evaluation benefits both product developers and consumers. Product developers are protected from self-delusion due to biases in their internal evaluations, ultimately leading to better products, and they are perhaps also protected from legal liability as the evaluation authority’s stamp of approval can provide evidence of due diligence in evaluating their products. Consumers benefit from objective standards by which they can compare products, analogous to crash safety ratings for automobiles or energy efficiency ratings for appliances.

In fact, a forerunner of the evaluation authorities and processes we envision already exists, under the umbrella of “machine learning (ML) competitions.”

What is an ML Competition?

Machine learning is large field of research and engineering where many organizations routinely run billions of experiments. Consequently, the field is in statistical crises. With every experiment comes some probability that a high performing model got lucky instead of smart. Bringing order to the epistemological chaos is the machine learning competition, which sets competitors out to maximize some measure on private data not provided to the competing teams.

The most famous ML competition was ImageNetfor which academics were asked to produce a computer system capable of labeling image contents. In 2012 an entry into the multi-year competition vastly outperformed other entrants and produced a sea change in machine learning research. Figure fig-imagenet depicts the rapid advancements on the ImageNet task.

The prevailing view of image classification prior to 2012.fig-xkcd

The prevailing view of image classification prior to 2012. The prevailing view of image classification prior to 2012.

By 2015, the prestige afforded to those besting the previous ImageNet leaders led a team of researchers to cheat on the competition. The research team queried the private test set across multiple sham accounts to tune their model to the private test set. As a result, the performance estimates of the competition became invalid for the model and the researchers were banned from the competition.

What can evaluation authorities tell us about system performance? Launched by Facebook in 2019, the Deepfake Detection Challengelooked to provide Facebook with tools that could detect computer-modified videos. While competitors made strong progress on the dataset Facebook filmed then modified in-house, the teams were ranked for a $1,000,000 prize purse based on data from the platform’s real users. Even though the user test data was not produced to circumvent detection, the degradation in performance between the Facebook-produced data and the Facebook user data as shown by Table tab-challengedata is considerable. In effect, the competitors had produced models mostly capable of detecting when a face had been swapped, and not many other computer manipulations. Subsequent analysis also revealed the models regularly false activate for people with skin diseases, such as vitiligo.

Evaluation authorities have the ability to detect when systems are not robust

Two competitor results from the Facebook Deepfake Detection Challenge. All models degraded significantly from their test set performance on Facebook generated data to test set data defined on user generated data.

Table Label: tab-challengedata

Download PDF to view table

The practice of competitions serving as a form of evaluation authority has extended to the corporate world with organizations like the ML Commons. Formed as an industry collaborative, ML Commons has 59 dues paying corporate members paying for evaluation datasets run by independent authorities. These evaluations historically were limited to simple properties such as accuracy, throughput, and energy, but the organization is increasingly integrating the evaluation and solution engineering steps to produce better performing systems across a wider array of performance attributes. The benchmarking and engineering needs in the commercial sector are increasingly aligning to the principles of data-centric governance and filling the need for evaluation authorities. As shown in the next section, the scope of datasets needed to service the full variety of intelligent systems now under development in industry will require a great many organizations to form evaluation authorities.

Governance and Engineering Datasets

As systems that are produced by and operate on data, the absence of a data-centered way of ensuring compliance for an AI system is an indication that the system is not sufficiently mature to deploy outside research contexts. To illustrate, we will walk through a series of AI incidents (i.e., harm events) where an appropriate governance dataset could have prevented the incident.

Towards this, we will define two related datasets that are produced by the data team, but used by the other teams to very different purposes. First we define “evaluation data,” then the “evaluation proxy.”

A dataset constructed to provide formal system performance evaluations.

Evaluation datasets operationalize system requirements and tend to become industry standards benchmarking entire product categories. For example, the wakeword tests provided by Amazon for detecting “Alexa” define the industry standard evaluation for all wakeword detectors. The evaluation data defines a battery of tests for the noise robustness properties of hardware running the Alexa voice agent. The tests are typically run in labs with real world human analogs as shown in Figure fig-hats.

Head and Torso Simulator (HATS) with Handset Positioner. A cottage industry of these thoroughly calibrated analogues has developed for a wide variety of industrial use cases to ensure all parties can replicate the results of lab tests during an engineering and test effort. Since tests in laboratory conditions are time-intensive and expensive to run, they are typically run a limited number of times. In most instances, it is possible to collect data from these elaborate test rigs once, then evaluate system performance on the data sampled from the physical environment.fig-hats

Head and Torso Simulator (HATS) with Handset Positioner. A cottage industry of these thoroughly calibrated analogues has developed for a wide variety of industrial use cases to ensure all parties can replicate the results of lab tests during an engineering and test effort. Since tests in laboratory conditions are time-intensive and expensive to run, they are typically run a limited number of times. In most instances, it is possible to collect data from these elaborate test rigs once, then evaluate system performance on the data sampled from the physical environment.

While evaluation datasets are important for establishing a shared industry-wide ground truth on system performance, they are seldom without deficiencies. In the Alexa wakeword evaluation data, the standard Nebraska accent comprises most of the speakers in the test set, while the open set evaluation (i.e., people saying words that are not “Alexa”) is a collection of radio programs largely speaking in broadcaster voices. Consequently, wakeword systems often false activate for more unusual inputs, underperform for black people, and in one incident randomly activated, recorded a voice memo, and sent it. These incidents are all related to the wakeword subsystem and are distinct from those caused by elements later in the system chain, which have included playing pornography instead of a children’s songand prompting a 10 year old to play a game involving putting a penny into an electrical socket. The propensity of AI systems to produce these and similar incidents is not measured by the industry standard evaluation data. Aspects of performance that are not directly measured are unknown, so many wakeword systems have undiscovered biases prior to academics evaluating the systems. Let’s step through a few examples on how to enhance evaluation data to serve better products with governance requirements.

Detect “out of scope” with scope data.

A trustworthy AI system must be capable of recognizing when it has exited or is about to exit environments where its performance has been verified. Consider an emergency stop button on a factory production line. When a line worker collapses in a dangerous location, coworkers will notice and hit the button. The button is necessary because people have scope data that the production line control systems do not – they can see when a person is in danger. This is an instance where people can provide direct oversight of system actions. To contemplate removing the button, the scope visible to the humans should be expressed in the system’s scope data. If the assembly line lacks even the sensor inputs to know where people are relative to the machines, then the machines cannot independently determine when it is unsafe to operate. Figure fig-scope gives one example where a robot’s operating scope is violated and it falls down an escalator and strikes a person.

For a robot navigating a mall, the escalators may be out of scope for appropriate deployment. An incidentin which the robot finds itself traveling down an escalator is a violation of scope that is readily identifiable to all present, but for the engineering and assurance of the system, data defining the escalator as out of bounds requires collection and structuring. With data that properly characterizes the operating scope, the system can be continuously tested for its potential to exit the scope and whether the system detects the dangerous state if an exit occurs.fig-scope

For a robot navigating a mall, the escalators may be out of scope for appropriate deployment. An incidentin which the robot finds itself traveling down an escalator is a violation of scope that is readily identifiable to all present, but for the engineering and assurance of the system, data defining the escalator as out of bounds requires collection and structuring. With data that properly characterizes the operating scope, the system can be continuously tested for its potential to exit the scope and whether the system detects the dangerous state if an exit occurs.

In another real-world incident, the Zillow Group in 2021 lost more than $80k on average for every home they purchased based on a valuation algorithm. Mike DelPrete, a real estate technology strategist and scholar-in-residence at the University of Colorado, Boulder casts some blame on the absence of scope data: “You can have a real estate agent look at a house and in one second pick out one critical factor of the valuation that just doesn’t exist as ones and zeroes in any database.” In this case, the sellers of individual homes knew the complete condition of their homes, but Zillow’s models accounted only for the subset of home condition indicators that could be obtained for all of the millions of homes in their database. Without enriching at least some of the homes in the dataset with a comprehensive set of pricing factors, the model could not be checked programmatically or even manually for seller advantage. Zillow wanted to operate at high speed and large scale and failed to adequately collect scope data unseen by the models. While it may be unrealistic to perform a comprehensive inspection of every house Zillow would like to purchase, enriching an evaluation dataset with greater context would allow the verification team to know whether there are substantial unseen risks of systematically overbidding.

Measure “edge case performance” with edge case data.

Where the scope data helps identify when a system is beyond its capacities, the edge cases define the data found just inside its supported requirements. Consider an incident where the Waze car navigation app repeatedly navigated drivers to unusually low-traffic areas in Los Angeles – ones that were on fire. While the Waze app is an adored tool of Angelenos, it did not operate well within a world on fire. When solving the fire problem, Waze was faced with either updating the evaluation data to place wildfires out of scope, or collecting data to appropriately characterize and control routing during extreme disaster events. In either case, data must be collected to characterize the operating context at its limit as shown by Figure fig-toxicity, which details an incident defined by a large collection of edge cases resulting from adversarially generated inputs.

In contrast to the escalator example of Figure fig-scope, the examples given above for a language toxicity model shows a collection of inputs that are necessarily in-scope for the model, but receive vastly different toxicity scores with small changes to the sentence. These are also instances of adversarial data (i.e., data produced with the express purpose of breaking the system). Adversarial data is the most common source of edge case data – several startups are developing platforms for the adversarial discovery of edge cases.fig-toxicity

In contrast to the escalator example of Figure <a href='#fig-scope'>fig-scope</a>, the examples given above for a language toxicity model shows a collection of inputs that are necessarily in-scope for the model, but receive vastly different toxicity scores with small changes to the sentence. These are also instances of adversarial data (i.e., data produced with the express purpose of breaking the system). Adversarial data is the most common source of edge case data – several startups are developing platforms for the adversarial discovery of edge cases.

Formalize realized risks with incident data.

Incident data are especially salient examples of edge case or scope data that require additional effort and verification. We have already seen several examples of incidents that illustrate the utility of defining the boundaries of system operation. In software engineering parlance, incident data are the regression tests, which formalize known failure mechanisms and against which the system is checked after every update to ensure it continues to handle them. Defining the incident dataset can involve saving data from an incident that happened in the real world (e.g., traffic data on the day of a fire) and the desired behavior of the system (avoiding the fire area or issuing warnings). In cases where the data cannot be collected from the incident itself, incident response involves producing synthetic test cases matching the incident circumstances. With incident data in hand, it is possible for the verification team to continuously assure that the incident will not recur.

Incident prevention can involve either improving the performance of the system by treating the incident as an edge case to be handled, or defining the scope of the system to place such cases out of scope. Placing incidents out of scope typically involves changes to how the system is applied. For instance, Figure fig-teenagers shows two examples of racial bias incidents in Web search queries. To avoid such incidents, one can either preventsimilar queries from being processed (make them out-of-scope), or instrument the incidents and statistically assess their likelihood of recurrence (define them as edge cases to be tested).

Google image search results for “Three black teenagers” versus for “Three white teenagers” show racial biases as the photos of black teenagers are predominantly mugshots. The incident data associated with this event would be the images returned exhibiting racial bias for both queries, along with additional labels for images within the search results indicating whether the images are mugshots. With the labels in hand, it is possible to codify in data the requirement that mugshots be returned with equal frequency for queries about black versus white people. While this incident can potentially be placed into the edge case data, Google chose to treat a related incident wherein black people could be labeled as gorillas as scope data by prohibiting the search and labeling of gorillas entirely.fig-teenagers

Google image search results for “Three black teenagers” versus for “Three white teenagers” show racial biases as the photos of black teenagers are predominantly mugshots. The incident data associated with this event would be the images returned exhibiting racial bias for both queries, along with additional labels for images within the search results indicating whether the images are mugshots. With the labels in hand, it is possible to codify in data the requirement that mugshots be returned with equal frequency for queries about black versus white people. While this incident can potentially be placed into the edge case data, Google chose to treat a related incident wherein black people could be labeled as gorillas as scope data by prohibiting the search and labeling of gorillas entirely.

Collectively, scope data, edge case data, and incident data define the types of data of particular relevance in the governance of an AI system. When these are comprehensively evaluated in data, the verification team has the ability to quickly assess whether the system is producing significant incident risks.

How do we know about the probability of risk?

Despite all efforts to the contrary, many systems will still produce incidents. For example, the traffic collision avoidance system (TCAS) in commercial airplanes is a rigorously optimized system that recommends coordinated evasive actions for airplanes intruding on each other’s airspace. Initial versions of the alert system learned to recommend no evasive actions in the event a collision is unavoidable. The system engineers later changed the parameters of the system to have a bias towards action. Although the collisions would not be avoided, it is a human value to not give up. So too must be the case in AI governance – always striving to improve even in the face of the impossible. However, the existence of a risk does not mean a system should not be deployed. Many planes will be saved by collision avoidance systems even if they do not prevent all collisions.

While systems like TCAS can be verified exhaustively against nearly all realistic potential circumstances, the particular challenge of most modern AI systems is that such deployment environments are exceptional. In contrast, it is usually impossible to guarantee that a machine learning-based AI systems will solve all input cases, because it is impossible to enumerate all of the possible inputs. Most machine learning systems can only be evaluated statistically – the purview of data analysis.

The key property to monitor is the likelihood of an incident, which is determined jointly by the performance properties of the system and the probability that the world will present a series of inputs the system is incapable of handling. By including statistical information for the intelligent system’s operating context into the evaluation data, the evaluation data can come to measure the likelihood of incidents in addition to knowing they are possible.

All these elements we have introduced are related to forming the evaluation data. Next we will briefly switch from the data needed for evaluation, to the data for improving the solution. We define these datasets as “engineering data.”

The data used for creating and improving the end solution.

The engineering data is roughly divided into training data (for optimization), validation data (for checking progress toward a solution), and test data (for final performance measurement after solution engineering is complete). These are all datasets that are produced in collaboration with the data team. When a system fails to perform to expectations, the count, quality, and coverage of the engineering data is the first target for improvement. No amount of modeling effort can compensate for inadequate data.

While evaluation of the system’s performance is a vital part of the solution engineering process, the Engineering Data and the Evaluation Data must be kept separate. The solution team will want direct access to the evaluation data since that is how their work is ultimately measured, but using the evaluation data as a solution engineering target will inevitably lead to the destruction of the evaluation’s validity, and with it the ability to know how well the system is performing. This is an instance of Goodhart’s law, which reads, “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control”, or more straightforwardly, “When a measure becomes a target, it ceases to be a good measure”. Russell and Norvig’s definitive textbook of AIsuccinctly describes how to avoid this hazard:

…really hold the test set out—lock it away until you are completely done with learning and simply wish to obtain an independent evaluation of the final hypothesis. (And then, if you don’t like the results … you have to obtain, and lock away, a completely new test set if you want to go back and find a better hypothesis.)

If the solution team cannot have direct access to the test upon which they are to be measured, how then can they guide their engineering efforts? Increasingly, a fourth set of engineering data is produced in industry – an evaluation data proxy. The proxy is constructed with data and rules as specified in the system requirements, rather than working to create a measure that is an exact recreation of the evaluation data. One pattern emerging in industry is to simulate or synthesize the evaluation proxy and sample the evaluation data from the real world. Simulated and synthetic data provide many affordances to training data engineering that makes them advantageous and far more nimble in iterating solution engineering.

Can we skip making an evaluation dataset or an evaluation proxy?

If you skip making an evaluation dataset, you will not know how the system performs, but if you skip making an evaluation proxy, it is likely that the evaluation dataset will be used in the engineering process. Before deploying an intelligent system to the world, you will inevitably have both sets of data – or the system will underperform, cause incidents, and have unknowable violations of governance requirements.

Make an evaluation proxy first and then independently construct the evaluation data.

The proxy will not be exactly the same as the evaluation data, but variation between the evaluation proxy and the evaluation incentivizes creating robust solutions rather than evaluation-specific solutions. Consider for example the efforts of users to circumvent toxicity models in Figure fig-toxicity. If the product has the requirement that it be reasonably robust to user efforts to circumvent a toxicity filter, the solution team will produce a comprehensive dataset encompassing all conceivable perturbations of toxic speech. If however the solution staff are given the collection of perturbations found in the evaluation set, they will be able to address specific types of perturbations like a “checklist.” Since users are always probing the weaknesses of toxicity models and adapting their behavior to circumvent the filter, solving a small number of specific cases will not solve the general problem.

Teams + Data

Teams and their interactions in the data production process. Every product begins its life with the product team defining the formal requirements for the solution in coordination with the verification team. The Data team then takes the requirements and collaborates with the Solution Team in the production of representative data. The Data Team's outputs are then either issued to the Solution Team or to the Verification Team, which subsequently takes delivery of the release candidate from the Solution Team. The verification team makes the final determination of whether the solution meets governance requirements before permitting its deployment. At the end of the process the development cycle is reentered to either improve system performance, or to expand the system scope via data now available.fig-data-pipeline

Teams and their interactions in the data production process. Every product begins its life with the product team defining the formal requirements for the solution in coordination with the verification team. The Data team then takes the requirements and collaborates with the Solution Team in the production of representative data. The Data Team's outputs are then either issued to the Solution Team or to the Verification Team, which subsequently takes delivery of the release candidate from the Solution Team. The verification team makes the final determination of whether the solution meets governance requirements before permitting its deployment. At the end of the process the development cycle is reentered to either improve system performance, or to expand the system scope via data now available.

Defining scope and collecting edge cases are standard concepts in the safety engineering community, but their realization in AI systems, which are probabilistic, is distinctive. Without collecting the data according to the data engineering process of Figure fig-data-pipeline, the capacity to know what the system will do is compromised and system governance is rendered impossible. Datasets consistent with Figure fig-venn require careful construction where only the data team has a comprehensive view of the data space. Indeed, many large technology companies with vast stores of user data recognize these risks, and thus make the data available selectively to solution and analytics teams without fully exposing the non-aggregated data to those teams.

Relationships among the datasets discussed in this section. All data should be consistent with data that can occur within the system's deployment environment. The engineering data is available to the solution team to improve the performance of the system. The scoping data characterizes the operating envelope of the system and, thus, when the system is beyond its competencies. The edge case data defines challenging instances at the boundaries of the system's scope that require solutions. The incident data characterize instances where harms have occurred or nearly occurred as a result of the system. The incidents are not directly available to the engineering team, but they should be covered by the evaluation proxy, which is meant to mirror the true performance evaluation that encompasses all incidents and a sampling of the non-incident data space.fig-venn

Relationships among the datasets discussed in this section. All data should be consistent with data that can occur within the system's deployment environment. The engineering data is available to the solution team to improve the performance of the system. The scoping data characterizes the operating envelope of the system and, thus, when the system is beyond its competencies. The edge case data defines challenging instances at the boundaries of the system's scope that require solutions. The incident data characterize instances where harms have occurred or nearly occurred as a result of the system. The incidents are not directly available to the engineering team, but they should be covered by the evaluation proxy, which is meant to mirror the true performance evaluation that encompasses all incidents and a sampling of the non-incident data space.

Continuous Assurance for Data-Centric Governance

Data-centric governance pays dividends throughout system deployment by enabling continuous assurance systems. Without appropriate systematization, governance requirements are burdensome and likely are not adhered to over the complete system life cycle. Governed intelligent systems require continuous assurance systems to align economic and governance requirements.

Consider an incident where an Amazon recruiting tool systematically down-ranked female candidates whose resumes included the word “women’s”. Data-centric governance can prevent this disparate impact by surfacing the problem before deploying the system. However, even presuming the system is perfect at the time of launch, it will immediately begin to degrade as job descriptions, corporate needs, and candidate experiences continue to evolve. In time, one or more protected classes will be systematically down-ranked by the recruitment tool and Amazon would be exposed to legal and regulatory risk running into the billions of dollars. Rather than continuously monitoring and patching the recruitment system, Amazon terminated the screening program. Most, if not all, intelligent system deployments are faced with similar choices of ignoring emergent system deficiencies, developing an ongoing governance program, or terminating the deployment. The graveyard of failed AI deployments is full of projects that failed to develop assurance systems.

Can I just buy an AI system and let the vendor figure governance out?

Almost. As we have previously shown, a system that is not governed via data is not one that is functionally governed. So if the vendor has a comprehensive suite of tools for governing their deployments, the data and dashboards they develop should be available to their customers. If they cannot provide this information, then they likely don’t have these systems and you are assuming unknowable risks.

Do not buy any intelligent system without a means of continuously assessing its performance.

While there currently is no comprehensive solution providing data-centric governance as a service, there are several products and services providing elements of continuous assurance from the perspective of the solution team. These systems can often be deployed in support of data-centric governance with appropriate accommodation for the previously detailed principles.

Current Systems

While thousands of companies, consultancies, and auditors are developing tools and processes implementing governance requirements, the post-deployment governance of a system is often realized as the responsibility of the solution team rather than the verification team. Solution teams know model performance degrades through time so they monitor and improve the deployment in response to emerging changes. The associated “model monitoring” systems have been built with a variety of features meeting the needs of the solution team, specifically.

A high-level mockup showing a user interface for evaluating the current state of intelligent system governance. Clockwise from the upper left the panels include the current state of performance across a collection of edge cases, performance across a collection of augmentations, whether the current inputs to the system conform to the distribution assumptions of the system, the number of incidents that are prevented by the currently deployed system, the performance on the training data, and the evaluation criteria summary. Each of these panels provide humans with a capacity to oversee the performance and evolution of the system.fig-mock

A high-level mockup showing a user interface for evaluating the current state of intelligent system governance. Clockwise from the upper left the panels include the current state of performance across a collection of edge cases, performance across a collection of augmentations, whether the current inputs to the system conform to the distribution assumptions of the system, the number of incidents that are prevented by the currently deployed system, the performance on the training data, and the evaluation criteria summary. Each of these panels provide humans with a capacity to oversee the performance and evolution of the system.

Data-centric governance involves additional controls and procedures on top of model monitoring systems. Where a comprehensive user interface as given by Figure fig-mock does not currently exist, the core features enabling the engineering of the user interface exist across a collection of open source and commercial offerings. The core features include:

  • Systems for capturing data
  • Systems for processing data
  • Visual interfaces
  • Continuous Integration/Continuous Delivery (CI/CD)

We explore each of these features in turn.

Systems for capturing data.

Computing has moved through several epochs in the distribution and maintenance of software. Initially, software could not be updated in computer systems because the hardware hard-coded the software in its physical realization. Subsequently, software could be periodically updated via physical media (e.g., punchcards or discs). Finally, software transitioned to a perpetual maintenance cycle where new versions are continually released in response to security vulnerabilities or to remain feature-competitive. The next stage in software maintenance that is informed by the needs of machine learning-based systems is to include data logging and collection.

For cloud-hosted intelligent systems, capturing live data is typically a simple matter of turning on the system’s logging feature. Products that do not necessarily require a constant cloud connection regardless often ship with one for the purpose of constantly improving performance. For example, Tesla vehicles produce a variety of sensor data that is uploaded to Tesla company servers. When connectivity to the cloud is not possible, many intelligent systems have a version of the “black boxes” found in commercial aircraft. If these systems were not functional requirements of the final deployment, they had to have been produced during solution engineering in order to iteratively improve the solution. Thus, while not all deployed systems have the ability to collect data from the field, the absence of such systems is often a choice driven by privacy concerns or solution cost rather than a technical capacity to collect data.

Systems for processing data.

“DataOps” is a rapidly expanding field of practice that aims to improve the quality, speed, and collaboration of data processes. Many startups have developed DataOps solutions for specific data types (e.g., images, video, autonomous driving, etc.) making it faster to apply human labels and programmatically process data (example in Figure fig-appen). After the data is collected and prepared, it can be connected to simulation environments for the intelligent system. For example, NIST’s Dioptraas shown in Figure fig-testbed and Seldon Coregive highly scalable ways to run models. All companies producing machine learning-based systems have either installed systems like these, or produced their own in house variants, during the solution engineering process.

An example object detection user interface for examining and applying labels to an image. From the marketing page of Appen.fig-appen

An example object detection user interface for examining and applying labels to an image. From the marketing page of Appen.

The NIST Dioptra architecture is an open source solution for running machine learning-based models against data. It combines several open source solutions collected to supporting fast and scalable inference. Fromfig-testbed

The NIST Dioptra architecture is an open source solution for running machine learning-based models against data. It combines several open source solutions collected to supporting fast and scalable inference. From

Visual interfaces.

A well-implemented system will seldom need human intervention, but a well-governed one provides systems to support human analysis when governance violations are detected. For instance, in speech recognition systems environmental noise (e.g., an unusual air conditioning system) can sometimes prevent the normal operation of the system. When these cases arise the task of debugging is similar to describing an intermittent noise to an auto mechanic. No amount of human effort at mimicking mechanical clunking sounds will be as useful as providing an analytic user interface.

A model page for a language model as hosted by Hugging Face. The model can be run interactively on the page and deployed to the cloud by all visitors to the website since the model is open source. The left column provides documentation on the risks, limitations, and biases of the model, but no formal and comprehensive evaluation of the identified bias properties are provided in the dataset evaluation listing of the lower right. Understanding the biases of the model is left as an exercise to the developer making use of the model. Instead, flat performance properties like accuracy and F1 scores are presented and verified by Hugging Face staff. Given the model has been downloaded more than 6 million times, it is likely that the vast majority of model deployers have not engaged in any sort of formal governance program. Note: all the contents of the page are found on the Hugging Face website, but we have deleted some contents so all the elements in the screen capture will be rendered together.fig-huggingface

A model page for a language model as hosted by Hugging Face. The model can be run interactively on the page and deployed to the cloud by all visitors to the website since the model is open source. The left column provides documentation on the risks, limitations, and biases of the model, but no formal and comprehensive evaluation of the identified bias properties are provided in the dataset evaluation listing of the lower right. Understanding the biases of the model is left as an exercise to the developer making use of the model. Instead, flat performance properties like accuracy and F1 scores are presented and verified by Hugging Face staff. Given the model has been downloaded more than 6 million times, it is likely that the vast majority of model deployers have not engaged in any sort of formal governance program. Note: all the contents of the page are found on the Hugging Face website, but we have deleted some contents so all the elements in the screen capture will be rendered together.

The model sharing and deployment company Hugging Face in one of their language models (see Figure fig-huggingface) indicates the model presents significant biases but does not formally evaluate those biases for the community. Instead, they provide a series of top level performance properties. Model monitoring companies close the gap between data evaluation and human oversight by incorporating visual analytic user interfaces into the data logging functionality. These include, Arize, WhyLabs, Grafana+Prometheus, Evidently, Qualdo, Fiddler, Amazon Sagemaker, Censius, ArthurAI, New Relic, Aporia, TruEra, Gantry, and likely others in this quickly expanding market space (see:for a rundown).

These systems are essentially data science platforms – they support a person exploring data as it is streaming in. What they don’t do without additional effort is codify requirements in such a way that they can be checked automatically and continuously. While it is possible to continually staff a data science project with personnel applying governance requirements, the value of data-centric governance is the formalization of the monitoring activity so that people do not continuously need to watch the data as it flows in.

Continuous Integration/Continuous Delivery (CI/CD).

The final system of continuous assurance is one that wraps the governance program in software systems that continuously check for compliance with requirements. Should those requirements be violated, then the system can either automatically move to a fail safe mode (typically this means shutting down) or alert humans to begin evaluating the system for potential safety and fairness issues.

Most software today is developed with systems for continuous integration (i.e., systems that continuously test for new failures or “regressions”) and continuous delivery (i.e., systems for the deployment of a model into the real world). For instance, the developer operations (DevOps) platform GitLab provides the ability to integrate, test, and deploy software updates as shown in Figure fig-gitlab. Seldon Core similarly provides systems shown in Figure fig-seldon supporting humans making the decision of whether a model should be deployed after reviewing the system performance as reported in testing.

Continuous integration and delivery processes as supported by the DevOps company GitLab. The process begins with a developer creating a new version of the code, which when pushed to the GitLab server triggers a cycle of automated testing and deploying fixes to any failing tests. When the tests pass and are approved, they then move to the delivery process where additional tests are run by GitLab before the system automatically deploys to the world as the “production” version of the system. Fromfig-gitlab

Continuous integration and delivery processes as supported by the DevOps company GitLab. The process begins with a developer creating a new version of the code, which when pushed to the GitLab server triggers a cycle of automated testing and deploying fixes to any failing tests. When the tests pass and are approved, they then move to the delivery process where additional tests are run by GitLab before the system automatically deploys to the world as the “production” version of the system. From

Seldon Core's model-centric view of continuous integration and delivery. Like GitLab in Figure fig-gitlab, the process begins with a developer (here a data scientist) updating the implementation, which then goes into a scalable computing environment known as a “Kubernetes cluster.” The scalable computing environment then runs tests and an approver can decide whether the update goes to a staging environment for further testing and/or the live production environment. As a data-centered practice, the system has multiple versions deployed simultaneously so each can be statistically compared in their live performance.fig-seldon

Seldon Core's model-centric view of continuous integration and delivery. Like GitLab in Figure <a href='#fig-gitlab'>fig-gitlab</a>, the process begins with a developer (here a data scientist) updating the implementation, which then goes into a scalable computing environment known as a “Kubernetes cluster.” The scalable computing environment then runs tests and an approver can decide whether the update goes to a staging environment for further testing and/or the live production environment. As a data-centered practice, the system has multiple versions deployed simultaneously so each can be statistically compared in their live performance.


The data needs, systems, and processes we have introduced may seem like a large burden, but they should be viewed in light of the benefits they provide. AI systems can operate at unlimited scale and speed. With strong data-centric governance, the quality of those solutions improves, fewer AI efforts fail, and system have a much longer useful life. Data-centric governance explicitly accounts for the hidden costs of program failures and moves uncertainties into reliable process steps.

When enacting a data-centric governance approach, the first step is to constitute or contract with teams capable of carrying out each of the functions we identified. With appropriate teams in place, it is possible to capture the insights and requirements of human auditors, stakeholders, and other governance process participants in a way that will most benefit the deployment of the system – rather than block deployment at the 11th hour.

Is it supposed to be this hard?

As a final case study, we appeal to the history of oceanic shipping, steam boilers, and electricity. Each were extremely risky in their early histories and regularly lead to loss of life and steep financial losses. Today shipping is very safe, steam boilers don’t regularly explode, and electricity is in every modern home with little risk of electrocution or fire. The story of all these industries becoming as safe as they are today is the story of the insurance industry. Insurance companies assess risks and charge fees according to those risks. When something is more expensive to insure, then you know it is also riskier than its competitors. Thus companies have an incentive to sail calm waters, design safer boilers, and standardize electrical wiring.

With a track record of anticipating emerging risks (e.g. for insuring the performance of green technologies), multinational insurance company, MunichRe, began offering insurance for AI systems. Scoped around insuring the performance of AI products (e.g., how well a system filters online content for moderation), the “aiSure” product requires the development of a suite of tools for monitoring system performance. In effect, MunichRe has arrived at a similar conclusion to that of data-centric governance – the operating conditions must be defined and continuously assessed. When deploying an AI system to the world, if you do not believe that MunichRe would be able to insure the system’s performance, then it is not functionally governed.

Is it supposed to be this hard? Yes! But it is worth it.

With systems of continuous assurance built into a solution from the start, governance becomes a product asset rather than a liability. We can build a more equitable and safer future together with AI.

We gratefully acknowledge the review and contributions of Andrea Brennen and Jill Crisman in the production of this work. As a position paper from and for the communities of test and evaluation, verification and validation, AI safety, machine learning, assurance systems, risk, and more, this paper would not be what it is without broad and varied input. We invite your review and feedback to improve the concepts and their communication to varied audiences.

Funding. This work was made possible by the funding of IQT Labs.


   2  keywords = {Computer Science - Machine Learning},
   3  note = {arXiv:2207.10062 [cs]},
   4  year = {2022},
   5  month = {July},
   6  author = {Mazumder, Mark and Banbury, Colby and Yao, Xiaozhe and Karlaš, Bojan and Rojas, William Gaviria and Diamos, Sudnya and Diamos, Greg and He, Lynn and Kiela, Douwe and Jurado, David and Kanter, David and Mosquera, Rafael and Ciro, Juan and Aroyo, Lora and Acun, Bilge and Eyuboglu, Sabri and Ghorbani, Amirata and Goodman, Emmett and Kane, Tariq and Kirkpatrick, Christine R. and Kuo, Tzu-Sheng and Mueller, Jonas and Thrush, Tristan and Vanschoren, Joaquin and Warren, Margaret and Williams, Adina and Yeung, Serena and Ardalani, Newsha and Paritosh, Praveen and Zhang, Ce and Zou, James and Wu, Carole-Jean and Coleman, Cody and Ng, Andrew and Mattson, Peter and Reddi, Vijay Janapa},
   7  publisher = {arXiv},
   8  urldate = {2023-02-14},
   9  abstract = {Machine learning (ML) research has generally focused on models, while the most prominent datasets have been employed for everyday ML tasks without regard for the breadth, difficulty, and faithfulness of these datasets to the underlying problem. Neglecting the fundamental importance of datasets has caused major problems involving data cascades in real-world applications and saturation of dataset-driven criteria for model quality, hindering research growth. To solve this problem, we present DataPerf, a benchmark package for evaluating ML datasets and dataset-working algorithms. We intend it to enable the "data ratchet," in which training sets will aid in evaluating test sets on the same problems, and vice versa. Such a feedback-driven strategy will generate a virtuous loop that will accelerate development of data-centric AI. The MLCommons Association will maintain DataPerf.},
  10  doi = {10.48550/arXiv.2207.10062},
  11  url = {},
  12  shorttitle = {{DataPerf}},
  13  title = {{DataPerf}: {Benchmarks} for {Data}-{Centric} {AI} {Development}},
  17  keywords = {Computer Science - Computers and Society, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning},
  18  note = {arXiv:2211.08419 [cs]},
  19  year = {2022},
  20  month = {November},
  21  author = {McGregor, Sean},
  22  publisher = {arXiv},
  23  urldate = {2022-12-13},
  24  abstract = {Emerging artificial intelligence (AI) applications often balance the preferences and impacts among diverse and contentious stakeholder groups. Accommodating these stakeholder groups during system design, development, and deployment requires tools for the elicitation of disparate system interests and collaboration interfaces supporting negotiation balancing those interests. This paper introduces interactive visual "participation interfaces" for Markov Decision Processes (MDPs) and collaborative ranking problems as examples restoring a human-centered locus of control.},
  25  doi = {10.48550/arXiv.2211.08419},
  26  url = {},
  27  title = {Participation {Interfaces} for {Human}-{Centered} {AI}},
  31  year = {2022},
  32  month = {November},
  33  author = {{Munich Re}},
  34  journal = {Munich Re},
  35  urldate = {2022-11-21},
  36  language = {en},
  37  abstract = {Artificial intelligence will turn many areas of the economy upside down: It offers us the chance to reduce costs, improve quality and increase profits, whether in fraud management, production processes or agriculture.},
  38  url = {},
  39  title = {Insure {AI} – {Guarantee} the performance of your {Artificial} {Intelligence} systems},
  43  keywords = {Computer Science - Artificial Intelligence, Computer Science - Computers and Society, Computer Science - Machine Learning},
  44  note = {arXiv:2211.10384 [cs]},
  45  year = {2022},
  46  month = {November},
  47  author = {McGregor, Sean and Paeth, Kevin and Lam, Khoa},
  48  publisher = {arXiv},
  49  booktitle = {{NeurIPS} {Workshop} on {Human}-{Centered} {AI}},
  50  urldate = {2022-11-21},
  51  abstract = {Two years after publicly launching the AI Incident Database (AIID) as a collection of harms or near harms produced by AI in the world, a backlog of "issues" that do not meet its incident ingestion criteria have accumulated in its review queue. Despite not passing the database's current criteria for incidents, these issues advance human understanding of where AI presents the potential for harm. Similar to databases in aviation and computer security, the AIID proposes to adopt a two-tiered system for indexing AI incidents (i.e., a harm or near harm event) and issues (i.e., a risk of a harm event). Further, as some machine learning-based systems will sometimes produce a large number of incidents, the notion of an incident "variant" is introduced. These proposed changes mark the transition of the AIID to a new version in response to lessons learned from editing 2,000+ incident reports and additional reports that fall under the new category of "issue."},
  52  url = {},
  53  title = {Indexing {AI} {Risks} with {Incidents}, {Issues}, and {Variants}},
  57  year = {2016},
  58  author = {{Underwriters Laboratories}},
  59  publisher = {Selby Marketing Associates},
  60  urldate = {2022-11-21},
  61  language = {en},
  62  abstract = {“Engineering Progress” is a comprehensive, historical account of UL Solutions.},
  63  url = {},
  64  title = {Engineering {Progress}},
  68  year = {2022},
  69  month = {November},
  70  author = {{ML Commons}},
  71  journal = {MLCommons},
  72  urldate = {2022-11-21},
  73  language = {en},
  74  abstract = {MLCommons aims to accelerate machine learning innovation to benefit everyone.},
  75  url = {},
  76  title = {{MLCommons}},
  80  year = {2022},
  81  month = {November},
  82  author = {{Appen}},
  83  journal = {Appen},
  84  urldate = {2022-11-21},
  85  language = {en-GB},
  86  abstract = {Access high-quality structureed or unstructured data to train models for unique use cases through a world-class technology platform},
  87  url = {},
  88  title = {Launch {World}-{Class} {AI} and {ML} {Projects} with {Confidence}},
  92  year = {2022},
  93  month = {November},
  94  author = {{GitLab}},
  95  journal = {GitLab Documentation},
  96  urldate = {2022-11-20},
  97  language = {en-us},
  98  abstract = {An overview of Continuous Integration, Continuous Delivery, and Continuous Deployment, as well as an introduction to GitLab CI/CD.},
  99  url = {},
 100  title = {{CI}/{CD} concepts {\textbar} {GitLab}},
 104  year = {2022},
 105  month = {November},
 106  author = {{Hugging Face}},
 107  journal = {Hugging Face - The AI Community Building the Future},
 108  urldate = {2022-11-20},
 109  abstract = {We’re on a journey to advance and democratize artificial intelligence through open source and open science.},
 110  url = {},
 111  title = {distilbert-base-uncased-finetuned-sst-2-english · {Hugging} {Face}},
 115  note = {Publisher: Responsible AI Collaborative},
 116  year = {2016},
 117  editor = {McGregor, Sean},
 118  author = {{Anonymous}},
 119  journal = {AI Incident Database},
 120  url = {},
 121  title = {Incident 37: {Female} {Applicants} {Down}-{Ranked} by {Amazon} {Recruiting} {Tool}},
 125  note = {Publisher: Responsible AI Collaborative},
 126  year = {2018},
 127  editor = {Lam, Khoa},
 128  author = {Colmer, Devon},
 129  journal = {AI Incident Database},
 130  url = {},
 131  title = {Incident 361: {Amazon} {Echo} {Mistakenly} {Recorded} and {Sent} {Private} {Conversation} to {Random} {Contact}},
 135  note = {Publisher: Responsible AI Collaborative},
 136  year = {2021},
 137  editor = {McGregor, Sean},
 138  author = {Lam, Khoa},
 139  journal = {AI Incident Database},
 140  url = {},
 141  title = {Incident 171: {Traffic} {Camera} {Misread} {Text} on {Pedestrian}'s {Shirt} as {License} {Plate}, {Causing} {UK} {Officials} to {Issue} {Fine} to an {Unrelated} {Person}},
 145  note = {Publisher: Responsible AI Collaborative},
 146  year = {2021},
 147  editor = {McGregor, Sean},
 148  author = {{Anonymous}},
 149  journal = {AI Incident Database},
 150  url = {},
 151  title = {Incident 160: {Alexa} {Recommended} {Dangerous} {TikTok} {Challenge} to {Ten}-{Year}-{Old} {Girl}},
 155  note = {Publisher: Responsible AI Collaborative},
 156  year = {2019},
 157  editor = {McGregor, Sean},
 158  author = {{Anonymous}},
 159  journal = {AI Incident Database},
 160  url = {},
 161  title = {Incident 159: {Tesla} {Autopilot}’s {Lane} {Recognition} {Allegedly} {Vulnerable} to {Adversarial} {Attacks}},
 165  note = {Publisher: Responsible AI Collaborative},
 166  year = {2021},
 167  editor = {McGregor, Sean},
 168  author = {{Anonymous}},
 169  journal = {AI Incident Database},
 170  url = {},
 171  title = {Incident 149: {Zillow} {Shut} {Down} {Zillow} {Offers} {Division} {Allegedly} {Due} to {Predictive} {Pricing} {Tool}'s {Insufficient} {Accuracy}},
 175  note = {Publisher: Responsible AI Collaborative},
 176  year = {2020},
 177  editor = {McGregor, Sean},
 178  author = {Hall, Patrick},
 179  journal = {AI Incident Database},
 180  url = {},
 181  title = {Incident 134: {Robot} in {Chinese} {Shopping} {Mall} {Fell} off the {Escalator}, {Knocking} down {Passengers}},
 185  note = {Publisher: Responsible AI Collaborative},
 186  year = {2018},
 187  editor = {McGregor, Sean},
 188  author = {Xie, Fabio},
 189  journal = {AI Incident Database},
 190  url = {},
 191  title = {Incident 114: {Amazon}'s {Rekognition} {Falsely} {Matched} {Members} of {Congress} to {Mugshots}},
 195  note = {Publisher: Responsible AI Collaborative},
 196  year = {2020},
 197  editor = {McGregor, Sean},
 198  author = {{Anonymous}},
 199  journal = {AI Incident Database},
 200  url = {},
 201  title = {Incident 102: {Personal} voice assistants struggle with black voices, new study shows},
 205  note = {Publisher: Responsible AI Collaborative},
 206  year = {2017},
 207  editor = {McGregor, Sean},
 208  author = {{Anonymous}},
 209  journal = {AI Incident Database},
 210  url = {},
 211  title = {Incident 68: {Security} {Robot} {Drowns} {Itself} in a {Fountain}},
 215  note = {Publisher: Responsible AI Collaborative},
 216  year = {2016},
 217  editor = {McGregor, Sean},
 218  author = {Yampolskiy, Roman},
 219  journal = {AI Incident Database},
 220  url = {},
 221  title = {Incident 55: {Alexa} {Plays} {Pornography} {Instead} of {Kids} {Song}},
 225  note = {Publisher: Responsible AI Collaborative},
 226  year = {2016},
 227  editor = {McGregor, Sean},
 228  author = {{AIAAIC}},
 229  journal = {AI Incident Database},
 230  url = {},
 231  title = {Incident 53: {Biased} {Google} {Image} {Results}},
 235  note = {Publisher: Responsible AI Collaborative},
 236  year = {2016},
 237  editor = {McGregor, Sean},
 238  author = {McGregor, Sean},
 239  journal = {AI Incident Database},
 240  url = {},
 241  title = {Incident 51: {Security} {Robot} {Rolls} {Over} {Child} in {Mall}},
 245  note = {Publisher: Responsible AI Collaborative},
 246  year = {2018},
 247  editor = {McGregor, Sean},
 248  author = {Olsson, Catherine},
 249  journal = {AI Incident Database},
 250  url = {},
 251  title = {Incident 36: {Picture} of {Woman} on {Side} of {Bus} {Shamed} for {Jaywalking}},
 255  note = {Publisher: Responsible AI Collaborative},
 256  year = {2015},
 257  editor = {McGregor, Sean},
 258  author = {Yampolskiy, Roman},
 259  journal = {AI Incident Database},
 260  url = {},
 261  title = {Incident 34: {Amazon} {Alexa} {Responding} to {Environmental} {Inputs}},
 265  note = {Publisher: Responsible AI Collaborative},
 266  year = {2017},
 267  editor = {McGregor, Sean},
 268  author = {Olsson, Catherine},
 269  journal = {AI Incident Database},
 270  url = {},
 271  title = {Incident 22: {Waze} {Navigates} {Motorists} into {Wildfires}},
 275  note = {Publisher: Responsible AI Collaborative},
 276  year = {2015},
 277  editor = {McGregor, Sean},
 278  author = {{Anonymous}},
 279  journal = {AI Incident Database},
 280  url = {},
 281  title = {Incident 16: {Images} of {Black} {People} {Labeled} as {Gorillas}},
 285  note = {Publisher: Responsible AI Collaborative},
 286  year = {2017},
 287  editor = {McGregor, Sean},
 288  author = {Olsson, Catherine},
 289  journal = {AI Incident Database},
 290  url = {},
 291  title = {Incident 13: {High}-{Toxicity} {Assessed} on {Text} {Involving} {Women} and {Minority} {Groups}},
 295  year = {2018},
 296  month = {November},
 297  author = {{Gartner}},
 298  institution = {Gartner},
 299  urldate = {2022-11-18},
 300  language = {en},
 301  abstract = {AI adoption in organizations has increased nearly threefold since last year, raising the chances of misaligned core technologies and AI initiatives. I\&O leaders must use a combination of buy, build and outsource to accelerate productivity in AI initiatives.},
 302  url = {},
 303  shorttitle = {Predicts 2019},
 304  title = {Predicts 2019: {Artificial} {Intelligence} {Core} {Technologies}},
 308  year = {2022},
 309  month = {November},
 310  author = {{The Linux Foundation}},
 311  urldate = {2022-11-15},
 312  url = {},
 313  title = {Models and pre-trained weights — {Torchvision} 0.14 documentation},
 317  note = {Publication Title: Google Blog},
 318  year = {2010},
 319  month = {October},
 320  author = {{GOOG-411Team}},
 321  publisher = {Google},
 322  url = {},
 323  title = {Goodbye to an old friend: 1-800-{GOOG}-411},
 327  note = {Publication Title: Infoworld},
 328  year = {2007},
 329  month = {October},
 330  author = {Perez, Juan Carlos},
 331  publisher = {Infoworld},
 332  url = {},
 333  title = {Google wants your phonemes},
 337  note = {Publication Title: Google Careers Blog},
 338  year = {2022},
 339  month = {March},
 340  author = {{Google}},
 341  publisher = {Google},
 342  url = {},
 343  title = {How one team turned the dream of speech recognition into a reality},
 347  year = {2022},
 348  month = {November},
 349  author = {{Grafana Labs}},
 350  journal = {Grafana Labs},
 351  urldate = {2022-11-09},
 352  language = {en},
 353  abstract = {Grafana feature overview, screenshots, videos, and feature tours.},
 354  url = {},
 355  title = {Grafana {\textbar} {Query}, visualize, alerting observability platform},
 359  pages = {305--321},
 360  note = {Publisher: Cambridge University Press},
 361  year = {1997},
 362  month = {July},
 363  author = {Strathern, Marilyn},
 364  journal = {European Review},
 365  urldate = {2022-11-07},
 366  number = {3},
 367  language = {en},
 368  abstract = {This paper gives an anthropological comment on what has been called the ‘audit explosion’, the proliferation of procedures for evaluating performance. In higher education the subject of audit (in this sense) is not so much the education of the students as the institutional provision for their education. British universities, as institutions, are increasingly subject to national scrutiny for teaching, research and administrative competence. In the wake of this scrutiny comes a new cultural apparatus of expectations and technologies. While the metaphor of financial auditing points to the important values of accountability, audit does more than monitor—it has a life of its own that jeopardizes the life it audits. The runaway character of assessment practices is analysed in terms of cultural practice. Higher education is intimately bound up with the origins of such practices, and is not just the latter day target of them. © 1997 by John Wiley \& Sons, Ltd.},
 369  doi = {10.1002/(SICI)1234-981X(199707)5:3<305::AID-EURO184>3.0.CO;2-4},
 370  url = {},
 371  shorttitle = {‘{Improving} ratings’},
 372  issn = {1474-0575, 1062-7987},
 373  volume = {5},
 374  title = {‘{Improving} ratings’: {Audit} in the {British} {University} system},
 378  pages = {5389--5400},
 379  note = {ISSN: 2640-3498},
 380  year = {2019},
 381  month = {May},
 382  author = {Recht, Benjamin and Roelofs, Rebecca and Schmidt, Ludwig and Shankar, Vaishaal},
 383  publisher = {PMLR},
 384  booktitle = {International {Conference} on {Machine} {Learning} ({ICML})},
 385  urldate = {2022-10-21},
 386  language = {en},
 387  abstract = {We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3\% - 15\% on CIFAR-10 and 11\% - 14\% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models’ inability to generalize to slightly "harder" images than those found in the original test sets.},
 388  url = {},
 389  title = {Do imagenet classifiers generalize to {ImageNet}?},
 393  keywords = {Computer Science - Artificial Intelligence},
 394  note = {arXiv:2209.06317 [cs]},
 395  year = {2022},
 396  month = {September},
 397  author = {Piorkowski, David and Hind, Michael and Richards, John},
 398  publisher = {arXiv},
 399  urldate = {2022-10-30},
 400  language = {en},
 401  abstract = {Although AI-based systems are increasingly being leveraged to provide value to organizations, individuals, and society, significant attendant risks have been identified [1]–[5]. These risks have led to proposed regulations, litigation, and general societal concerns.},
 402  url = {},
 403  shorttitle = {Quantitative ai risk assessments},
 404  title = {Quantitative {AI} risk assessments: {Opportunities} and challenges},
 408  year = {2022},
 409  month = {October},
 410  author = {Van Looveren, Arnaud and Klaise, Janis and Vacanti, Giovanni and Cobb, Oliver and Scillitoe, Ashley and Samoilescu, Robert and Athorne, Alex},
 411  url = {},
 412  title = {Alibi {Detect}: {Algorithms} for outlier, adversarial and drift detection},
 416  year = {2022},
 417  month = {November},
 418  author = {{Seldon Core}},
 419  journal = {Seldon Core},
 420  urldate = {2022-11-09},
 421  language = {en-US},
 422  abstract = {Open-source platform for rapidly deploying machine learning models on Kubernetes The de facto standard open-source platform for rapidly deploying machine learning models […]},
 423  url = {},
 424  title = {Seldon {Core}},
 428  year = {2022},
 429  month = {November},
 430  author = {{National Institute of Standards and Technology}},
 431  journal = {National Institute of Standards and Technology},
 432  urldate = {2022-11-09},
 433  url = {},
 434  title = {What is {Dioptra}? — {Dioptra} 0.0.0 documentation},
 438  year = {2021},
 439  month = {March},
 440  author = {Czakon, Jakub},
 441  journal = {},
 442  urldate = {2022-11-09},
 443  language = {en-US},
 444  abstract = {If you deploy models to production sooner or later, you will start looking for ML model monitoring tools. When your ML models impact the business (and they should), you just need visibility into “how things work”. The first moment you really feel this is when things stop working. With no model monitoring set up, you […]},
 445  url = {},
 446  title = {Best {Tools} to {Do} {ML} {Model} {Monitoring}},
 450  keywords = {91E45, Computer Science - Artificial Intelligence, Quantitative Finance - General Finance, Statistics - Machine Learning},
 451  note = {arXiv:1803.04585 [cs, q-fin, stat]},
 452  year = {2019},
 453  month = {February},
 454  author = {Manheim, David and Garrabrant, Scott},
 455  publisher = {arXiv},
 456  urldate = {2022-11-06},
 457  abstract = {There are several distinct failure modes for overoptimization of systems on the basis of metrics. This occurs when a metric which can be used to improve a system is used to an extent that further optimization is ineffective or harmful, and is sometimes termed Goodhart's Law. This class of failure is often poorly understood, partly because terminology for discussing them is ambiguous, and partly because discussion using this ambiguous terminology ignores distinctions between different failure modes of this general type. This paper expands on an earlier discussion by Garrabrant, which notes there are "(at least) four different mechanisms" that relate to Goodhart's Law. This paper is intended to explore these mechanisms further, and specify more clearly how they occur. This discussion should be helpful in better understanding these types of failures in economic regulation, in public policy, in machine learning, and in Artificial Intelligence alignment. The importance of Goodhart effects depends on the amount of power directed towards optimizing the proxy, and so the increased optimization power offered by artificial intelligence makes it especially critical for that field.},
 458  doi = {10.48550/arXiv.1803.04585},
 459  url = {},
 460  title = {Categorizing {Variants} of {Goodhart}'s {Law}},
 464  year = {2009},
 465  author = {{Stuart Russell} and {Peter Norvig}},
 466  publisher = {Prentice Hall},
 467  urldate = {2022-11-06},
 468  language = {English},
 469  url = {},
 470  isbn = {0-13-604259-7},
 471  title = {Artificial {Intelligence}: {A} {Modern} {Approach}, 3rd {US} ed.},
 475  note = {Journal Abbreviation: 2012 20th IEEE International Requirements Engineering Conference, RE 2012 - Proceedings
 476Publication Title: 2012 20th IEEE International Requirements Engineering Conference, RE 2012 - Proceedings},
 477  doi = {10.1109/RE.2012.6345811},
 478  year = {2012},
 479  month = {September},
 480  author = {Knauss, Eric and Damian, Daniela and Poo-Caamaño, Germán and Cleland-Huang, Jane},
 481  publisher = {20th IEEE International Requirements Engineering Conference},
 482  abstract = {In current project environments, requirements often evolve throughout the project and are worked on by stakeholders in large and distributed teams. Such teams often use online tools such as mailing lists, bug tracking systems or online discussion forums to communicate, clarify or coordinate work on requirements. In this kind of environment, the expected evolution from initial idea, through clarification, to a stable requirement, often stagnates. When project managers are not aware of underlying problems, development may pro-ceed before requirements are fully understood and stabilized, leading to numerous implementation issues and often resulting in the need for early redesign and modification. In this paper, we present an approach to analyzing online requirements communication and a method for the detection and classification of clarification events in requirement discus-sions. We used our approach to analyze online requirements communication in the IBM R Rational Team Concert R (RTC) project and identified a set of six clarification patterns. Since a predominant amount of clarifications through the lifetime of a requirement is often indicative of problematic requirements, our approach lends support to project managers to assess, in real-time, the state of discussions around a requirement and promptly react to requirements problems.},
 483  title = {Detecting and {Classifying} {Patterns} of {Requirements} {Clarifications}},
 487  year = {2019},
 488  month = {April},
 489  author = {{Michael Sayre}},
 490  journal = {Medium},
 491  urldate = {2022-11-05},
 492  language = {en},
 493  abstract = {With the advent of new tools and technologies, it’s tempting to think that the rules of work have changed or that old problems can be…},
 494  url = {},
 495  title = {The significance of “edge cases” and the cost of imperfection as it pertains to {AI} adoption},
 499  year = {2021},
 500  month = {December},
 501  author = {{Erin Mulvaney}},
 502  journal = {Bloomberg Law},
 503  urldate = {2022-11-05},
 504  language = {en},
 505  abstract = {New York City has a new law on the books—one of the boldest measures of its kind in the country—that aims to curb hiring bias that can occur when businesses use artificial intelligence tools to screen out job candidates.},
 506  url = {},
 507  title = {{NYC} {Targets} {Artificial} {Intelligence} {Bias} in {Hiring} {Under} {New} {Law}},
 508  type = {News},
 512  year = {2021},
 513  month = {March},
 514  author = {{Jared Dunnmon} and {Bryce Goodman} and {Peter Kirechu} and {Carol Smith} and {Alexandrea Van Deusen}},
 515  institution = {Defense Innovation Unit},
 516  urldate = {2022-11-04},
 517  url = {},
 518  title = {Responsible {AI} {Guidelines} in {Practice}},
 522  year = {2021},
 523  month = {February},
 524  author = {{European Union}},
 525  urldate = {2022-11-05},
 526  language = {en-US},
 527  url = {},
 528  title = {The {AI} {Act}},
 532  year = {2022},
 533  month = {May},
 534  author = {Mökander, Jakob and Floridi, Luciano},
 535  journal = {AI and Ethics},
 536  urldate = {2022-11-05},
 537  language = {en},
 538  abstract = {Ethics-based auditing (EBA) is a structured process whereby an entity’s past or present behaviour is assessed for consistency with moral principles or norms. Recently, EBA has attracted much attention as a governance mechanism that may help to bridge the gap between principles and practice in AI ethics. However, important aspects of EBA—such as the feasibility and effectiveness of different auditing procedures—have yet to be substantiated by empirical research. In this article, we address this knowledge gap by providing insights from a longitudinal industry case study. Over 12 months, we observed and analysed the internal activities of AstraZeneca, a biopharmaceutical company, as it prepared for and underwent an ethicsbased AI audit. While previous literature concerning EBA has focussed on proposing or analysing evaluation metrics or visualisation techniques, our findings suggest that the main difficulties large multinational organisations face when conducting EBA mirror classical governance challenges. These include ensuring harmonised standards across decentralised organisations, demarcating the scope of the audit, driving internal communication and change management, and measuring actual outcomes. The case study presented in this article contributes to the existing literature by providing a detailed description of the organisational context in which EBA procedures must be integrated to be feasible and effective.},
 539  doi = {10.1007/s43681-022-00171-7},
 540  url = {},
 541  shorttitle = {Operationalising {AI} governance through ethics-based auditing},
 542  issn = {2730-5953, 2730-5961},
 543  title = {Operationalising {AI} governance through ethics-based auditing: an industry case study},
 547  pages = {105},
 548  year = {2022},
 549  month = {November},
 550  author = {{Information Commisioner's Office}},
 551  institution = {Information Commisioner's Office},
 552  urldate = {2022-11-04},
 553  number = {20200214},
 554  language = {en},
 555  url = {},
 556  title = {Guidance on the {AI} auditing framework {Draft} guidance for consultation},
 560  year = {2022},
 561  month = {November},
 562  author = {{Office of the Director of National Intelligence}},
 563  publisher = {Office of the Director of National Intelligence},
 564  urldate = {2022-11-04},
 565  url = {},
 566  title = {Artificial {Intelligence} {Ethics} {Framework} for the {Intelligence} {Community}},
 570  year = {2022},
 571  month = {November},
 572  author = {{IBM}},
 573  urldate = {2022-11-05},
 574  language = {en-us},
 575  abstract = {IBM’s multidisciplinary, multidimensional approach to AI ethics.},
 576  url = {},
 577  title = {{AI} {Ethics}},
 581  year = {2022},
 582  month = {November},
 583  author = {{Google}},
 584  journal = {Google AI},
 585  urldate = {2022-11-05},
 586  language = {en},
 587  url = {},
 588  title = {Building responsible {AI} for everyone},
 592  keywords = {Computer Science - Computers and Society, Computer Science - Machine Learning, Computer Science - Social and Information Networks},
 593  note = {arXiv:1702.08138 [cs]},
 594  year = {2017},
 595  month = {February},
 596  author = {Hosseini, Hossein and Kannan, Sreeram and Zhang, Baosen and Poovendran, Radha},
 597  publisher = {arXiv},
 598  urldate = {2022-10-31},
 599  language = {en},
 600  abstract = {Social media platforms provide an environment where people can freely engage in discussions. Unfortunately, they also enable several problems, such as online harassment. Recently, Google and Jigsaw started a project called Perspective, which uses machine learning to automatically detect toxic language. A demonstration website has been also launched, which allows anyone to type a phrase in the interface and instantaneously see the toxicity score [1].},
 601  url = {},
 602  title = {Deceiving {Google}'s {Perspective} {API} {Built} for {Detecting} {Toxic} {Comments}},
 606  year = {2020},
 607  author = {Buslaev, Alexander and Iglovikov, Vladimir I. and Khvedchenya, Eugene and Parinov, Alex and Druzhinin, Mikhail and Kalinin, Alexandr A.},
 608  journal = {Information},
 609  number = {2},
 610  doi = {10.3390/info11020125},
 611  url = {},
 612  issn = {2078-2489},
 613  volume = {11},
 614  title = {Albumentations: {Fast} and {Flexible} {Image} {Augmentations}},
 618  pages = {1006--1014},
 619  year = {2015},
 620  author = {Blum, Avrim and Hardt, Moritz},
 621  publisher = {PMLR},
 622  booktitle = {International {Conference} on {Machine} {Learning}},
 623  shorttitle = {The ladder},
 624  title = {The ladder: {A} reliable leaderboard for machine learning competitions},
 628  year = {2020},
 629  author = {D'Amour, Alexander and Heller, Katherine and Moldovan, Dan and Adlam, Ben and Alipanahi, Babak and Beutel, Alex and Chen, Christina and Deaton, Jonathan and Eisenstein, Jacob and Hoffman, Matthew D and {others}},
 630  journal = {arXiv preprint arXiv:2011.03395},
 631  title = {Underspecification presents challenges for credibility in modern machine learning},
 635  pages = {8},
 636  year = {2009},
 637  author = {Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li},
 638  journal = {IEEE Conference on Computer Vision and Pattern Recognition},
 639  language = {en},
 640  abstract = {The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a largescale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 5001000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.},
 641  title = {{ImageNet}: {A} {Large}-{Scale} {Hierarchical} {Image} {Database}},
 645  year = {2019},
 646  month = {March},
 647  author = {Tencent Keen Security Lab},
 648  publisher = {Tencent},
 649  url = {},
 650  title = {Experimental security research of {Tesla} autopilot},
 654  keywords = {Computer Science - Computers and Society, Computer Science - Machine Learning},
 655  note = {arXiv:2108.02922 [cs]},
 656  year = {2021},
 657  month = {November},
 658  author = {Peng, Kenny and Mathur, Arunesh and Narayanan, Arvind},
 659  publisher = {arXiv},
 660  booktitle = {Advances in {Neural} {Information} {Processing} {Systems} ({NeurIPS})},
 661  urldate = {2022-10-23},
 662  language = {en},
 663  abstract = {Machine learning datasets have elicited concerns about privacy, bias, and unethical applications, leading to the retraction of prominent datasets such as DukeMTMC, MS-Celeb-1M, and Tiny Images. In response, the machine learning community has called for higher ethical standards in dataset creation. To help inform these efforts, we studied three influential but ethically problematic face and person recognition datasets—Labeled Faces in the Wild (LFW), MS-Celeb-1M, and DukeMTMC—by analyzing nearly 1000 papers that cite them. We found that the creation of derivative datasets and models, broader technological and social change, the lack of clarity of licenses, and dataset management practices can introduce a wide range of ethical concerns. We conclude by suggesting a distributed approach to harm mitigation that considers the entire life cycle of a dataset.},
 664  url = {},
 665  shorttitle = {Mitigating dataset harms requires stewardship},
 666  title = {Mitigating dataset harms requires stewardship: {Lessons} from 1000 papers},
 670  year = {2019},
 671  author = {Barocas, Solon and Hardt, Moritz and Narayanan, Arvind},
 672  publisher = {},
 673  url = {},
 674  title = {Fairness and machine learning},
 678  pages = {1--13},
 679  year = {2021},
 680  author = {Lee, Michelle Seng Ah and Singh, Jat},
 681  booktitle = {Proceedings of the 2021 {CHI} conference on human factors in computing systems},
 682  title = {The landscape and gaps in open source fairness toolkits},
 686  year = {2018},
 687  author = {Saleiro, Pedro and Kuester, Benedict and Hinkson, Loren and London, Jesse and Stevens, Abby and Anisfeld, Ari and Rodolfa, Kit T. and Ghani, Rayid},
 688  journal = {arXiv preprint arXiv:1811.05577},
 689  shorttitle = {Aequitas},
 690  title = {Aequitas: {A} bias and fairness audit toolkit},
 694  pages = {4--1},
 695  note = {Publisher: IBM},
 696  year = {2019},
 697  author = {Bellamy, Rachel KE and Dey, Kuntal and Hind, Michael and Hoffman, Samuel C. and Houde, Stephanie and Kannan, Kalapriya and Lohia, Pranay and Martino, Jacquelyn and Mehta, Sameep and Mojsilović, Aleksandra},
 698  journal = {IBM Journal of Research and Development},
 699  number = {4/5},
 700  shorttitle = {{AI} {Fairness} 360},
 701  volume = {63},
 702  title = {{AI} {Fairness} 360: {An} extensible toolkit for detecting and mitigating algorithmic bias},
 706  pages = {e1356},
 707  keywords = {fairness, fairness-aware AI, fairness-aware machine learning, interpretability, responsible AI},
 708  note = {\_eprint:},
 709  year = {2020},
 710  author = {Ntoutsi, Eirini and Fafalios, Pavlos and Gadiraju, Ujwal and Iosifidis, Vasileios and Nejdl, Wolfgang and Vidal, Maria-Esther and Ruggieri, Salvatore and Turini, Franco and Papadopoulos, Symeon and Krasanakis, Emmanouil and Kompatsiaris, Ioannis and Kinder-Kurlanda, Katharina and Wagner, Claudia and Karimi, Fariba and Fernandez, Miriam and Alani, Harith and Berendt, Bettina and Kruegel, Tina and Heinze, Christian and Broelemann, Klaus and Kasneci, Gjergji and Tiropanis, Thanassis and Staab, Steffen},
 711  journal = {WIREs Data Mining and Knowledge Discovery},
 712  urldate = {2022-10-22},
 713  number = {3},
 714  language = {en},
 715  abstract = {Artificial Intelligence (AI)-based systems are widely employed nowadays to make decisions that have far-reaching impact on individuals and society. Their decisions might affect everyone, everywhere, and anytime, entailing concerns about potential human rights issues. Therefore, it is necessary to move beyond traditional AI algorithms optimized for predictive performance and embed ethical and legal principles in their design, training, and deployment to ensure social good while still benefiting from the huge potential of the AI technology. The goal of this survey is to provide a broad multidisciplinary overview of the area of bias in AI systems, focusing on technical challenges and solutions as well as to suggest new research directions towards approaches well-grounded in a legal frame. In this survey, we focus on data-driven AI, as a large part of AI is powered nowadays by (big) data and powerful machine learning algorithms. If otherwise not specified, we use the general term bias to describe problems related to the gathering or processing of data that might result in prejudiced decisions on the bases of demographic features such as race, sex, and so forth. This article is categorized under: Commercial, Legal, and Ethical Issues {\textgreater} Fairness in Data Mining Commercial, Legal, and Ethical Issues {\textgreater} Ethical Considerations Commercial, Legal, and Ethical Issues {\textgreater} Legal Issues},
 716  doi = {10.1002/widm.1356},
 717  url = {},
 718  issn = {1942-4795},
 719  volume = {10},
 720  title = {Bias in data-driven artificial intelligence systems—{An} introductory survey},
 724  pages = {1--35},
 725  note = {Publisher: ACM New York, NY, USA},
 726  year = {2021},
 727  author = {Mehrabi, Ninareh and Morstatter, Fred and Saxena, Nripsuta and Lerman, Kristina and Galstyan, Aram},
 728  journal = {ACM Computing Surveys (CSUR)},
 729  number = {6},
 730  volume = {54},
 731  title = {A survey on bias and fairness in machine learning},
 735  pages = {1--9},
 736  year = {2021},
 737  month = {October},
 738  author = {Suresh, Harini and Guttag, John},
 739  publisher = {ACM},
 740  booktitle = {Equity and {Access} in {Algorithms}, {Mechanisms}, and {Optimization}},
 741  urldate = {2022-10-22},
 742  language = {en},
 743  doi = {10.1145/3465416.3483305},
 744  url = {},
 745  isbn = {978-1-4503-8553-4},
 746  title = {A {Framework} for {Understanding} {Sources} of {Harm} throughout the {Machine} {Learning} {Life} {Cycle}},
 747  address = {-- NY USA},
 751  pages = {95--109},
 752  keywords = {Hazard analysis, Safety analysis, Stereo vision, Test data, Testing, Validation},
 753  year = {2017},
 754  month = {December},
 755  author = {Zendel, Oliver and Murschitz, Markus and Humenberger, Martin and Herzner, Wolfgang},
 756  journal = {International Journal of Computer Vision},
 757  urldate = {2022-10-22},
 758  number = {1},
 759  language = {en},
 760  abstract = {Good test data is crucial for driving new developments in computer vision (CV), but two questions remain unanswered: which situations should be covered by the test data, and how much testing is enough to reach a conclusion? In this paper we propose a new answer to these questions using a standard procedure devised by the safety community to validate complex systems: the hazard and operability analysis (HAZOP). It is designed to systematically identify possible causes of system failure or performance loss. We introduce a generic CV model that creates the basis for the hazard analysis and—for the first time—apply an extensive HAZOP to the CV domain. The result is a publicly available checklist with more than 900 identified individual hazards. This checklist can be utilized to evaluate existing test datasets by quantifying the covered hazards. We evaluate our approach by first analyzing and annotating the popular stereo vision test datasets Middlebury and KITTI. Second, we demonstrate a clearly negative influence of the hazards in the checklist on the performance of six popular stereo matching algorithms. The presented approach is a useful tool to evaluate and improve test datasets and creates a common basis for future dataset designs.},
 761  doi = {10.1007/s11263-017-1020-z},
 762  url = {},
 763  shorttitle = {How good is my test data?},
 764  issn = {1573-1405},
 765  volume = {125},
 766  title = {How good is my test data? {Introducing} safety analysis for computer vision},
 770  keywords = {Computer Science - Computer Vision and Pattern Recognition},
 771  note = {arXiv:1807.01232 [cs]},
 772  year = {2019},
 773  month = {July},
 774  author = {Van Etten, Adam and Lindenbaum, Dave and Bacastow, Todd M.},
 775  publisher = {arXiv},
 776  urldate = {2022-10-21},
 777  abstract = {Foundational mapping remains a challenge in many parts of the world, particularly in dynamic scenarios such as natural disasters when timely updates are critical. Updating maps is currently a highly manual process requiring a large number of human labelers to either create features or rigorously validate automated outputs. We propose that the frequent revisits of earth imaging satellite constellations may accelerate existing efforts to quickly update foundational maps when combined with advanced machine learning techniques. Accordingly, the SpaceNet partners (CosmiQ Works, Radiant Solutions, and NVIDIA), released a large corpus of labeled satellite imagery on Amazon Web Services (AWS) called SpaceNet. The SpaceNet partners also launched a series of public prize competitions to encourage improvement of remote sensing machine learning algorithms. The first two of these competitions focused on automated building footprint extraction, and the most recent challenge focused on road network extraction. In this paper we discuss the SpaceNet imagery, labels, evaluation metrics, prize challenge results to date, and future plans for the SpaceNet challenge series.},
 778  url = {},
 779  shorttitle = {{SpaceNet}},
 780  title = {{SpaceNet}: {A} {Remote} {Sensing} {Dataset} and {Challenge} {Series}},
 784  year = {2022},
 785  month = {January},
 786  author = {{Andrea Brennen} and {Ryan Ashley}},
 787  journal = {In-Q-Tel},
 788  urldate = {2022-10-21},
 789  language = {en-US},
 790  abstract = {IQT Labs recently audited an open-source deep learning tool called FakeFinder that predicts whether or not a video is a […]},
 791  url = {},
 792  shorttitle = {{AI} {Assurance}},
 793  title = {{AI} {Assurance}: {What} happened when we audited a deepfake detection tool called {FakeFinder}},
 797  year = {2019},
 798  author = {Dolhansky, B. and Bitton, J. and Pflaum, B. and Lu, J. and Howes, R. and Wang, M. and Ferrer, C. Canton},
 799  journal = {arXiv e-prints},
 800  title = {The {DeepFake} {Detection} {Challenge} ({DFDC}) {Dataset}},
 804  keywords = {Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computers and Society, Computer Science - Machine Learning},
 805  note = {arXiv:2102.06109 [cs]},
 806  year = {2021},
 807  month = {February},
 808  author = {Leibowicz, Claire and McGregor, Sean and Ovadya, Aviv},
 809  publisher = {arXiv},
 810  urldate = {2022-10-21},
 811  abstract = {Synthetic media detection technologies label media as either synthetic or non-synthetic and are increasingly used by journalists, web platforms, and the general public to identify misinformation and other forms of problematic content. As both well-resourced organizations and the non-technical general public generate more sophisticated synthetic media, the capacity for purveyors of problematic content to adapt induces a {\textbackslash}newterm\{detection dilemma\}: as detection practices become more accessible, they become more easily circumvented. This paper describes how a multistakeholder cohort from academia, technology platforms, media entities, and civil society organizations active in synthetic media detection and its socio-technical implications evaluates the detection dilemma. Specifically, we offer an assessment of detection contexts and adversary capacities sourced from the broader, global AI and media integrity community concerned with mitigating the spread of harmful synthetic media. A collection of personas illustrates the intersection between unsophisticated and highly-resourced sponsors of misinformation in the context of their technical capacities. This work concludes that there is no "best" approach to navigating the detector dilemma, but derives a set of implications from multistakeholder input to better inform detection process decisions and policies, in practice.},
 812  url = {},
 813  shorttitle = {The {Deepfake} {Detection} {Dilemma}},
 814  title = {The {Deepfake} {Detection} {Dilemma}: {A} {Multistakeholder} {Exploration} of {Adversarial} {Dynamics} in {Synthetic} {Media}},
 818  year = {2022},
 819  month = {October},
 820  author = {{Robert Stojnic} and {Taylor, Ross} and {Kardas, Marcin} and {Scialom, Scialom}},
 821  journal = {Papers with Code},
 822  urldate = {2022-10-21},
 823  language = {en},
 824  abstract = {The current state-of-the-art on ImageNet is CoCa (finetuned). See a full comparison of 756 papers with code.},
 825  url = {},
 826  title = {Papers with {Code} - {ImageNet} {Benchmark} ({Image} {Classification})},
 830  year = {2014},
 831  month = {September},
 832  author = {Munroe, Randall},
 833  journal = {xkcd},
 834  urldate = {2022-10-21},
 835  url = {},
 836  title = {Tasks},
 840  year = {2015},
 841  month = {June},
 842  author = {Simonite, Tom},
 843  journal = {MIT Technology Review},
 844  urldate = {2022-10-21},
 845  language = {en},
 846  abstract = {Machine learning gets its first cheating scandal.},
 847  url = {},
 848  title = {Why and {How} {Baidu} {Cheated} an {Artificial} {Intelligence} {Test}},
 852  pages = {1097--1105},
 853  note = {event-place: Lake Tahoe, Nevada},
 854  year = {2012},
 855  author = {Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E.},
 856  publisher = {Curran Associates Inc.},
 857  booktitle = {Proceedings of the 25th {International} {Conference} on {Neural} {Information} {Processing} {Systems} - {Volume} 1},
 858  abstract = {We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5\% and 17.0\% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3\%, compared to 26.2\% achieved by the second-best entry.},
 859  title = {{ImageNet} {Classification} with {Deep} {Convolutional} {Neural} {Networks}},
 860  series = {{NIPS}'12},
 861  address = {Red Hook, NY, USA},
 865  urldate = {2022-10-21},
 866  url = {},
 867  title = {Experimental\_Security\_Research\_of\_Tesla\_Autopilot.pdf},
 871  keywords = {AI, AI Act, AI Auditing, AI legislation, Conformity Assessment, EU, Trustworthy AI},
 872  year = {2022},
 873  month = {March},
 874  author = {Floridi, Luciano and Holweg, Matthias and Taddeo, Mariarosaria and Amaya Silva, Javier and Mökander, Jakob and Wen, Yuni},
 875  urldate = {2022-10-21},
 876  language = {en},
 877  abstract = {We have developed capAI, a conformity assessment procedure for AI systems, to provide an independent, comparable, quantifiable, and accountable assessment of AI systems that conforms with the proposed AIA regulation. By building on the AIA, capAI provides organisations with practical guidance on how high-level ethics principles can be translated into verifiable criteria that help shape the design, development, deployment and use of ethical AI. The main purpose of capAI is to serve as a governance tool that ensures and demonstrates that the development and operation of an AI system are trustworthy – i.e., legally compliant, ethically sound, and technically robust – and thus conform to the AIA.},
 878  doi = {10.2139/ssrn.4064091},
 879  url = {},
 880  title = {{capAI} - {A} procedure for conducting conformity assessment of {AI} systems in line with the {EU} {Artificial} {Intelligence} {Act}},
 881  type = {{SSRN} {Scholarly} {Paper}},
 882  address = {Rochester, NY},
 886  pages = {1--8},
 887  keywords = {Computer science, Medical research, Research data, ai/medicine, ai/testing},
 888  note = {Number: 1
 889Publisher: Nature Publishing Group},
 890  year = {2022},
 891  month = {April},
 892  author = {Varoquaux, Gaël and Cheplygina, Veronika},
 893  journal = {npj Digital Medicine},
 894  urldate = {2022-10-21},
 895  number = {1},
 896  language = {en},
 897  abstract = {Research in computer analysis of medical images bears many promises to improve patients’ health. However, a number of systematic challenges are slowing down the progress of the field, from limitations of the data, such as biases, to research incentives, such as optimizing for publication. In this paper we review roadblocks to developing and assessing methods. Building our analysis on evidence from the literature and data challenges, we show that at every step, potential biases can creep in. On a positive note, we also discuss on-going efforts to counteract these problems. Finally we provide recommendations on how to further address these problems in the future.},
 898  doi = {10.1038/s41746-022-00592-y},
 899  url = {},
 900  shorttitle = {Machine learning for medical imaging},
 901  issn = {2398-6352},
 902  copyright = {2022 The Author(s)},
 903  volume = {5},
 904  title = {Machine learning for medical imaging: {Methodological} failures and recommendations for the future},
 908  pages = {8916--8925},
 909  keywords = {ai/fairness, ai/vision},
 910  year = {2020},
 911  month = {June},
 912  author = {Wang, Zeyu and Qinami, Klint and Karakozis, Ioannis Christos and Genova, Kyle and Nair, Prem and Hata, Kenji and Russakovsky, Olga},
 913  publisher = {IEEE},
 914  booktitle = {2020 {IEEE}/{CVF} {Conference} on {Computer} {Vision} and {Pattern} {Recognition} ({CVPR})},
 915  urldate = {2022-10-20},
 916  language = {en},
 917  abstract = {Computer vision models learn to perform a task by capturing relevant statistics from training data. It has been shown that models learn spurious age, gender, and race correlations when trained for seemingly unrelated tasks like activity recognition or image captioning. Various mitigation techniques have been presented to prevent models from utilizing or learning such biases. However, there has been little systematic comparison between these techniques. We design a simple but surprisingly effective visual recognition benchmark for studying bias mitigation. Using this benchmark, we provide a thorough analysis of a wide range of techniques. We highlight the shortcomings of popular adversarial training approaches for bias mitigation, propose a simple but similarly effective alternative to the inference-time Reducing Bias Amplification method of Zhao et al., and design a domain-independent training technique that outperforms all other methods. Finally, we validate our findings on the attribute classification task in the CelebA dataset, where attribute presence is known to be correlated with the gender of people in the image, and demonstrate that the proposed technique is effective at mitigating real-world gender bias.},
 918  doi = {10.1109/CVPR42600.2020.00894},
 919  url = {},
 920  shorttitle = {Towards {Fairness} in {Visual} {Recognition}},
 921  isbn = {978-1-72817-168-5},
 922  title = {Towards {Fairness} in {Visual} {Recognition}: {Effective} {Strategies} for {Bias} {Mitigation}},
 923  address = {Seattle, WA, USA},
 927  pages = {18},
 928  keywords = {ai/open-world, ai/trust},
 929  year = {2019},
 930  author = {Hendrycks, Dan and Mazeika, Mantas and Dietterich, Thomas},
 931  booktitle = {International {Conference} on {Learning} {Representations} ({ICLR})},
 932  language = {en},
 933  abstract = {It is important to detect anomalous inputs when deploying machine learning systems. The use of larger and more complex inputs in deep learning magnifies the difficulty of distinguishing between anomalous and in-distribution examples. At the same time, diverse image and text data are available in enormous quantities. We propose leveraging these data to improve deep anomaly detection by training anomaly detectors against an auxiliary dataset of outliers, an approach we call Outlier Exposure (OE). This enables anomaly detectors to generalize and detect unseen anomalies. In extensive experiments on natural language processing and small- and large-scale vision tasks, we find that Outlier Exposure significantly improves detection performance. We also observe that cutting-edge generative models trained on CIFAR-10 may assign higher likelihoods to SVHN images than to CIFAR-10 images; we use OE to mitigate this issue. We also analyze the flexibility and robustness of Outlier Exposure, and identify characteristics of the auxiliary dataset that improve performance.},
 934  title = {Deep {Anomaly} {Detection} with {Outlier} {Exposure}},
 938  pages = {4902--4912},
 939  keywords = {ai/nlp, ai/testing},
 940  year = {2020},
 941  author = {Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer},
 942  publisher = {Association for Computational Linguistics},
 943  booktitle = {Proceedings of the 58th {Annual} {Meeting} of the {Association} for {Computational} {Linguistics}},
 944  urldate = {2022-10-13},
 945  language = {en},
 946  abstract = {Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a taskagnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.},
 947  doi = {10.18653/v1/2020.acl-main.442},
 948  url = {},
 949  shorttitle = {Beyond {Accuracy}},
 950  title = {Beyond {Accuracy}: {Behavioral} {Testing} of {NLP} {Models} with {CheckList}},
 951  address = {Online},
 955  note = {arXiv:2007.07399 [cs]},
 956  year = {2020},
 957  month = {July},
 958  author = {Denton, Emily and Hanna, Alex and Amironesei, Razvan and Smart, Andrew and Nicole, Hilary and Scheuerman, Morgan Klaus},
 959  publisher = {arXiv},
 960  urldate = {2022-10-07},
 961  abstract = {In response to algorithmic unfairness embedded in sociotechnical systems, significant attention has been focused on the contents of machine learning datasets which have revealed biases towards white, cisgender, male, and Western data subjects. In contrast, comparatively less attention has been paid to the histories, values, and norms embedded in such datasets. In this work, we outline a research program - a genealogy of machine learning data - for investigating how and why these datasets have been created, what and whose values influence the choices of data to collect, the contextual and contingent conditions of their creation. We describe the ways in which benchmark datasets in machine learning operate as infrastructure and pose four research questions for these datasets. This interrogation forces us to "bring the people back in" by aiding us in understanding the labor embedded in dataset construction, and thereby presenting new avenues of contestation for other researchers encountering the data.},
 962  url = {},
 963  shorttitle = {Bringing the people back in},
 964  title = {Bringing the people back in: {Contesting} benchmark machine learning datasets},
 968  note = {arXiv:2006.07159 [cs]},
 969  year = {2020},
 970  month = {June},
 971  author = {Beyer, Lucas and Hénaff, Olivier J. and Kolesnikov, Alexander and Zhai, Xiaohua and Oord, Aäron van den},
 972  publisher = {arXiv},
 973  urldate = {2022-10-10},
 974  language = {en},
 975  abstract = {Yes, and no. We ask whether recent progress on the ImageNet classification benchmark continues to represent meaningful generalization, or whether the community has started to overfit to the idiosyncrasies of its labeling procedure. We therefore develop a significantly more robust procedure for collecting human annotations of the ImageNet validation set. Using these new labels, we reassess the accuracy of recently proposed ImageNet classifiers, and find their gains to be substantially smaller than those reported on the original labels. Furthermore, we find the original ImageNet labels to no longer be the best predictors of this independently-collected set, indicating that their usefulness in evaluating vision models may be nearing an end. Nevertheless, we find our annotation procedure to have largely remedied the errors in the original labels, reinforcing ImageNet as a powerful benchmark for future research in visual recognition3.},
 976  url = {},
 977  title = {Are we done with {ImageNet}?},
 981  pages = {702--703},
 982  keywords = {ai/data-augmentation},
 983  year = {2020},
 984  author = {Cubuk, Ekin D. and Zoph, Barret and Shlens, Jonathon and Le, Quoc V.},
 985  booktitle = {Proceedings of the {IEEE}/{CVF} conference on computer vision and pattern recognition workshops},
 986  shorttitle = {Randaugment},
 987  title = {Randaugment: {Practical} automated data augmentation with a reduced search space},
 991  pages = {5637--5664},
 992  keywords = {ai/datasets, ai/domain-shift},
 993  year = {2021},
 994  author = {Koh, Pang Wei and Sagawa, Shiori and Marklund, Henrik and Xie, Sang Michael and Zhang, Marvin and Balsubramani, Akshay and Hu, Weihua and Yasunaga, Michihiro and Phillips, Richard Lanas and Gao, Irena and Lee, Tony and David, Etienne and Stavness, Ian and Guo, Wei and Earnshaw, Berton A. and Haque, Imran S. and Beery, Sara and Leskovec, Jure and Kundaje, Anshul and Pierson, Emma and Levine, Sergey and Finn, Chelsea and Liang, Percy},
 995  publisher = {PMLR},
 996  booktitle = {International {Conference} on {Machine} {Learning} ({ICML})},
 997  shorttitle = {Wilds},
 998  title = {Wilds: {A} benchmark of in-the-wild distribution shifts},
1002  pages = {15458--15463},
1003  note = {Issue: 17},
1004  year = {2021},
1005  author = {McGregor, Sean},
1006  booktitle = {{AAAI} {Conference} on {Artificial} {Intelligence}},
1007  shorttitle = {Preventing repeated real world {AI} failures by cataloging incidents},
1008  volume = {35},
1009  title = {Preventing repeated real world {AI} failures by cataloging incidents: {The} {AI} {Incident} {Database}},
1013  keywords = {ai/audit, ai/trust, government},
1014  note = {Section: Technical Reports},
1015  year = {2016},
1016  month = {June},
1017  author = {David, Ruth A. and Nielsen, Paul},
1018  urldate = {2022-10-06},
1019  number = {AD1017790},
1020  language = {en},
1021  abstract = {At the request of the Under Secretary of Defense for Acquisition, Technology, and Logistics USDAT and L, the Defense Science Board DSB conducted a study on the applicability of autonomy to Department of Defense DoD missions. The study concluded that there are both substantial operational benefits and potential perils associated with the use of autonomy. Autonomy delivers significant military value, including opportunities to reduce the number of warfighters in harms way, increase the quality and speed of decisions in time-critical operations, and enable new missions that would otherwise be impossible. Autonomy is by no means new to the DoD. Fielded capabilities demonstrate ongoing progress in embedding autonomous functionality into systems, and many development programs already underway include an increasingly sophisticated use of autonomy. Autonomy also delivers significant value across a diverse array of global markets. Both enabling technologies and commercial applications are advancing rapidly in response to market opportunities. Autonomy is becoming a ubiquitous enabling capability for products spanning a spectrum from expert advisory systems to autonomous vehicles. Commercial market forces are accelerating progress, providing opportunities for DoD to leverage the investments of others, while also providing substantial capabilities to potential adversaries. This study concluded that DoD must accelerate its exploitation of autonomy both to realize the potential military value and to remain ahead of adversaries who also will exploit its operational benefits.},
1022  url = {},
1023  title = {Defense {Science} {Board} summer study on autonomy},
1027  pages = {3967--3972},
1028  keywords = {ai/trust, government},
1029  year = {2019},
1030  month = {February},
1031  journal = {Federal Register},
1032  number = {31},
1033  url = {},
1034  volume = {84},
1035  title = {Executive {Order} 13859. {Maintaining} {American} leadership in artificial intelligence.},
1039  keywords = {Computer Science - Machine Learning, Statistics - Machine Learning},
1040  note = {arXiv:1804.05862 [cs, stat]},
1041  year = {2019},
1042  month = {February},
1043  author = {Zhou, Wenda and Veitch, Victor and Austern, Morgane and Adams, Ryan P. and Orbanz, Peter},
1044  publisher = {arXiv},
1045  booktitle = {International {Conference} on {Learning} {Representations} ({ICLR})},
1046  urldate = {2022-10-10},
1047  language = {en},
1048  abstract = {Modern neural networks are highly overparameterized, with capacity to substantially overfit to training data. Nevertheless, these networks often generalize well in practice. It has also been observed that trained networks can often be “compressed” to much smaller representations. The purpose of this paper is to connect these two empirical observations. Our main technical result is a generalization bound for compressed networks based on the compressed size that, combined with off-theshelf compression algorithms, leads to state-of-the-art generalization guarantees. In particular, we provide the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem. Additionally, we show that compressibility of models that tend to overfit is limited. Empirical results show that an increase in overfitting increases the number of bits required to describe a trained network.},
1049  url = {},
1050  shorttitle = {Non-vacuous generalization bounds at the imagenet scale},
1051  title = {Non-vacuous generalization bounds at the {ImageNet} scale: {A} {PAC}-{Bayesian} compression approach},
1055  pages = {10},
1056  year = {2018},
1057  author = {Zhao, Shengjia and Ren, Hongyu and Yuan, Arianna and Song, Jiaming and Goodman, Noah and Ermon, Stefano},
1058  booktitle = {Advances in {Neural} {Information} {Processing} {Systems} ({NeurIPS})},
1059  language = {en},
1060  abstract = {In high dimensional settings, density estimation algorithms rely crucially on their inductive bias. Despite recent empirical success, the inductive bias of deep generative models is not well understood. In this paper we propose a framework to systematically investigate bias and generalization in deep generative models of images. Inspired by experimental methods from cognitive psychology, we probe each learning algorithm with carefully designed training datasets to characterize when and how existing models generate novel attributes and their combinations. We identify similarities to human psychology and verify that these patterns are consistent across commonly used models and architectures.},
1061  title = {Bias and generalization in deep generative models: {An} empirical study},
1065  pages = {11},
1066  year = {2020},
1067  author = {Tsipras, Dimitris and Santurkar, Shibani and Engstrom, Logan and Ilyas, Andrew and Ma, Aleksander},
1068  booktitle = {International {Conference} on {Machine} {Learning} ({ICML})},
1069  language = {en},
1070  abstract = {Building rich machine learning datasets in a scal-
1071able manner often necessitates a crowd-sourced
1072data collection pipeline. In this work, we use hu-
1073man studies to investigate the consequences of em-
1074ploying such a pipeline, focusing on the popular
1075ImageNet dataset. We study how specific design
1076choices in the ImageNet creation process impact
1077the fidelity of the resulting dataset—including the
1078introduction of biases that state-of-the-art models
1079exploit. Our analysis pinpoints how a noisy data
1080collection pipeline can lead to a systematic mis-
1081alignment between the resulting benchmark and
1082the real-world task it serves as a proxy for. Finally,
1083our findings emphasize the need to augment our
1084current model training and evaluation toolkit to
1085take such misalignments into account.},
1086  title = {From {ImageNet} to image classification: {Contextualizing} progress on benchmarks},
1090  keywords = {Computer Science - Machine Learning, Statistics - Machine Learning},
1091  note = {arXiv:1812.05159 [cs, stat]},
1092  year = {2019},
1093  author = {Toneva, Mariya and Sordoni, Alessandro and Combes, Remi Tachet des and Trischler, Adam and Bengio, Yoshua and Gordon, Geoffrey J.},
1094  publisher = {arXiv},
1095  booktitle = {International {Conference} on {Learning} {Representations} ({ICLR})},
1096  urldate = {2022-10-10},
1097  language = {en},
1098  abstract = {Inspired by the phenomenon of catastrophic forgetting, we investigate the learning dynamics of neural networks as they train on single classification tasks. Our goal is to understand whether a related phenomenon occurs when data does not undergo a clear distributional shift. We define a “forgetting event” to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Across several benchmark data sets, we find that: (i) certain examples are forgotten with high frequency, and some not at all; (ii) a data set’s (un)forgettable examples generalize across neural architectures; and (iii) based on forgetting dynamics, a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.},
1099  url = {},
1100  title = {An empirical study of example forgetting during deep neural network learning},
1104  pages = {11},
1105  year = {2020},
1106  author = {Shankar, Vaishaal and Roelofs, Rebecca and Mania, Horia and Fang, Alex and Recht, Benjamin and Schmidt, Ludwig},
1107  booktitle = {International {Conference} on {Machine} {Learning} ({ICML})},
1108  language = {en},
1109  abstract = {We evaluate a wide range of ImageNet models with five trained human labelers. In our year-long experiment, trained humans first annotated 40,000 images from the ImageNet and ImageNetV2 test sets with multi-class labels to enable a semantically coherent evaluation. Then we measured the classification accuracy of the five trained humans on the full task with 1,000 classes. Only the latest models from 2020 are on par with our best human labeler, and human accuracy on the 590 object classes is still 4\% and 11\% higher than the best model on ImageNet and ImageNetV2, respectively. Moreover, humans achieve the same accuracy on ImageNet and ImageNetV2, while all models see a consistent accuracy drop. Overall, our results show that there is still substantial room for improvement on ImageNet and direct accuracy comparisons between humans and machines may overstate machine performance.},
1110  title = {Evaluating machine accuracy on {ImageNet}},
1114  pages = {41},
1115  keywords = {ai/explainability, ai/trust, government},
1116  author = {Sayler, Kelley M},
1117  number = {R45178},
1118  language = {en},
1119  abstract = {Artificial intelligence (AI) is a rapidly growing field of technology with potentially significant implications for national security. As such, the U.S. Department of Defense (DOD) and other nations are developing AI applications for a range of military functions. AI research is underway in the fields of intelligence collection and analysis, logistics, cyber operations, information operations, command and control, and in a variety of semiautonomous and autonomous vehicles. Already, AI has been incorporated into military operations in Iraq and Syria. Congressional action has the potential to shape the technology’s development further, with budgetary and legislative decisions influencing the growth of military applications as well as the pace of their adoption.},
1120  url = {},
1121  title = {Artificial intelligence and national security},
1125  pages = {13},
1126  year = {2021},
1127  author = {Minderer, Matthias and Djolonga, Josip and Romijnders, Rob and Hubis, Frances and Zhai, Xiaohua and Houlsby, Neil and Tran, Dustin and Lucic, Mario},
1128  booktitle = {Advances in {Neural} {Information} {Processing} {Systems} ({NeurIPS})},
1129  language = {en},
1130  abstract = {Accurate estimation of predictive uncertainty (model calibration) is essential for the safe application of neural networks. Many instances of miscalibration in modern neural networks have been reported, suggesting a trend that newer, more accurate models produce poorly calibrated predictions. Here, we revisit this question for recent state-of-the-art image classification models. We systematically relate model calibration and accuracy, and find that the most recent models, notably those not using convolutions, are among the best calibrated. Trends observed in prior model generations, such as decay of calibration with distribution shift or model size, are less pronounced in recent architectures. We also show that model size and amount of pretraining do not fully explain these differences, suggesting that architecture is a major determinant of calibration properties.},
1131  url = {},
1132  title = {Revisiting the calibration of modern neural networks},
1136  pages = {176--187},
1137  note = {ISSN: 0302-9743, 1611-3349
1138Series Title: Lecture Notes in Computer Science},
1139  doi = {10.1007/978-3-540-74976-9_19},
1140  year = {2007},
1141  editor = {Kok, Joost N. and Koronacki, Jacek and Lopez de Mantaras, Ramon and Matwin, Stan and Mladenič, Dunja and Skowron, Andrzej},
1142  author = {Kowalczyk, Adam},
1143  publisher = {Springer Berlin Heidelberg},
1144  booktitle = {Knowledge {Discovery} in {Databases} ({KDD})},
1145  urldate = {2022-10-10},
1146  language = {en},
1147  abstract = {We demonstrate a binary classification problem in which standard supervised learning algorithms such as linear and kernel SVM, naive Bayes, ridge regression, k-nearest neighbors, shrunken centroid, multilayer perceptron and decision trees perform in an unusual way. On certain data sets they classify a randomly sampled training subset nearly perfectly, but systematically perform worse than random guessing on cases unseen in training. We demonstrate this phenomenon in classification of a natural data set of cancer genomics microarrays using crossvalidation test. Additionally, we generate a range of synthetic datasets, the outcomes of 0-sum games, for which we analyse this phenomenon in the i.i.d. setting.},
1148  url = {},
1149  isbn = {978-3-540-74975-2 978-3-540-74976-9},
1150  volume = {4702},
1151  title = {Classification of anti-learnable biological and synthetic data},
1152  address = {Berlin, Heidelberg},
1156  pages = {9},
1157  year = {2015},
1158  author = {Hardt, Moritz and Blum, Avrim},
1159  booktitle = {International {Conference} on {Machine} {Learning} ({ICML})},
1160  language = {en},
1161  abstract = {The organizer of a machine learning competi-
1162tion faces the problem of maintaining an accurate
1163leaderboard that faithfully represents the quality
1164of the best submission of each competing team.
1165What makes this estimation problem particularly
1166challenging is its sequential and adaptive nature.
1167As participants are allowed to repeatedly evaluate
1168their submissions on the leaderboard, they may
1169begin to overfit to the holdout data that supports
1170the leaderboard. Few theoretical results give ac-
1171tionable advice on how to design a reliable leader-
1172board. Existing approaches therefore often resort
1173to poorly understood heuristics such as limiting
1174the bit precision of answers and the rate of re-
1176In this work, we introduce a notion of leader-
1177board accuracy tailored to the format of a com-
1178petition. We introduce a natural algorithm called
1179the Ladder and demonstrate that it simultaneously
1180supports strong theoretical guarantees in a fully
1181adaptive model of estimation, withstands practical
1182adversarial attacks, and achieves high utility on
1183real submission files from an actual competition
1184hosted by Kaggle.
1185Notably, we are able to sidestep a powerful recent
1186hardness result for adaptive risk estimation that
1187rules out algorithms such as ours under a seem-
1188ingly very similar notion of accuracy. On a practi-
1189cal note, we provide a completely parameter-free
1190variant of our algorithm that can be deployed in a
1191real competition with no tuning required whatso-
1193  title = {The {Ladder}: a reliable leaderboard for machine learning competitions},
1197  pages = {10},
1198  year = {2017},
1199  author = {Guo, Chuan and Pleiss, Geoff and Sun, Yu and Weinberger, Kilian Q},
1200  booktitle = {International {Conference} on {Machine} {Learning} ({ICML})},
1201  language = {en},
1202  abstract = {Confidence calibration – the problem of predicting probability estimates representative of the true correctness likelihood – is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-ofthe-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling – a singleparameter variant of Platt Scaling – is surprisingly effective at calibrating predictions.},
1203  title = {On calibration of modern neural networks},
1207  pages = {11},
1208  year = {2020},
1209  author = {Engstrom, Logan and Ilyas, Andrew and Santurkar, Shibani and Tsipras, Dimitris and Steinhardt, Jacob and Madry, Aleksander},
1210  booktitle = {International {Conference} on {Machine} {Learning} ({ICML})},
1211  language = {en},
1212  abstract = {Dataset replication is a useful tool for assessing whether improvements in test accuracy on a specific benchmark correspond to improvements in models’ ability to generalize reliably. In this work, we present unintuitive yet significant ways in which standard approaches to dataset replication introduce statistical bias, skewing the resulting observations. We study ImageNet-v2, a replication of the ImageNet dataset on which models exhibit a significant (11-14\%) drop in accuracy, even after controlling for selection frequency, a human-in-the-loop measure of data quality. We show that after remeasuring selection frequencies and correcting for statistical bias, only an estimated 3.6\%±1.5\% of the original 11.7\%±1.0\% accuracy drop remains unaccounted for. We conclude with concrete recommendations for recognizing and avoiding bias in dataset replication. Code for our study is publicly available1.},
1213  title = {Identifying statistical bias in dataset replication},
1217  pages = {9},
1218  year = {2015},
1219  author = {Dwork, Cynthia and Feldman, Vitaly and Hardt, Moritz and Pitassi, Toni and Reingold, Omer and Roth, Aaron},
1220  booktitle = {Advances in {Neural} {Information} {Processing} {Systems} ({NeurIPS})},
1221  language = {en},
1222  abstract = {Overfitting is the bane of data analysts, even when data are plentiful. Formal approaches to understanding this problem focus on statistical inference and generalization of individual analysis procedures. Yet the practice of data analysis is an inherently interactive and adaptive process: new analyses and hypotheses are proposed after seeing the results of previous ones, parameters are tuned on the basis of obtained results, and datasets are shared and reused. An investigation of this gap has recently been initiated by the authors in [7], where we focused on the problem of estimating expectations of adaptively chosen functions.},
1223  title = {Generalization in adaptive data analysis and holdout reuse},
1227  pages = {1563--1572},
1228  year = {2016},
1229  month = {June},
1230  author = {Bendale, Abhijit and Boult, Terrance E.},
1231  publisher = {IEEE},
1232  booktitle = {{IEEE} {Conference} on {Computer} {Vision} and {Pattern} {Recognition} ({CVPR})},
1233  urldate = {2022-10-11},
1234  language = {en},
1235  abstract = {Deep networks have produced significant gains for various visual recognition problems, leading to high impact academic and commercial applications. Recent work in deep networks highlighted that it is easy to generate images that humans would never classify as a particular object class, yet networks classify such images high confidence as that given class – deep network are easily fooled with images humans do not consider meaningful. The closed set nature of deep networks forces them to choose from one of the known classes leading to such artifacts. Recognition in the real world is open set, i.e. the recognition system should reject unknown/unseen classes at test time. We present a methodology to adapt deep networks for open set recognition, by introducing a new model layer, OpenMax, which estimates the probability of an input being from an unknown class. A key element of estimating the unknown probability is adapting Meta-Recognition concepts to the activation patterns in the penultimate layer of the network. OpenMax allows rejection of “fooling” and unrelated open set images presented to the system; OpenMax greatly reduces the number of obvious errors made by a deep network. We prove that the OpenMax concept provides bounded open space risk, thereby formally providing an open set recognition solution. We evaluate the resulting open set deep networks using pre-trained networks from the Caffe Model-zoo on ImageNet 2012 validation data, and thousands of fooling and open set images. The proposed OpenMax model significantly outperforms open set recognition accuracy of basic deep networks as well as deep networks with thresholding of SoftMax probabilities.},
1236  doi = {10.1109/CVPR.2016.173},
1237  url = {},
1238  isbn = {978-1-4673-8851-1},
1239  title = {Towards open set deep networks},
1240  address = {Las Vegas, NV, USA},
1244  pages = {1893--1902},
1245  year = {2015},
1246  month = {June},
1247  author = {Bendale, Abhijit and Boult, Terrance},
1248  publisher = {IEEE},
1249  booktitle = {{IEEE} {Conference} on {Computer} {Vision} and {Pattern} {Recognition} ({CVPR})},
1250  urldate = {2022-10-11},
1251  language = {en},
1252  abstract = {With the of advent rich classification models and high computational power visual recognition systems have found many operational applications. Recognition in the real world poses multiple challenges that are not apparent in controlled lab environments. The datasets are dynamic and novel categories must be continuously detected and then added. At prediction time, a trained system has to deal with myriad unseen categories. Operational systems require minimal downtime, even to learn. To handle these operational issues, we present the problem of Open World Recognition and formally define it. We prove that thresholding sums of monotonically decreasing functions of distances in linearly transformed feature space can balance “open space risk” and empirical risk. Our theory extends existing algorithms for open world recognition. We present a protocol for evaluation of open world recognition systems. We present the Nearest Non-Outlier (NNO) algorithm that evolves model efficiently, adding object categories incrementally while detecting outliers and managing open space risk. We perform experiments on the ImageNet dataset with 1.2M+ images to validate the effectiveness of our method on large scale visual recognition tasks. NNO consistently yields superior results on open world recognition.},
1253  doi = {10.1109/CVPR.2015.7298799},
1254  url = {},
1255  isbn = {978-1-4673-6964-0},
1256  title = {Towards open world recognition},
1257  address = {Boston, MA, USA},
1261  pages = {10},
1262  year = {2017},
1263  author = {Arpit, Devansh and Jastrzebski, Stanisław and Ballas, Nicolas and Krueger, David and Bengio, Emmanuel and Kanwal, Maxinder S and Maharaj, Tegan and Fischer, Asja and Courville, Aaron and Bengio, Yoshua and Lacoste-Julien, Simon},
1264  booktitle = {International {Conference} on {Machine} {Learning} ({ICML})},
1265  language = {en},
1266  abstract = {We examine the role of memorization in deep learning, drawing connections to capacity, generalization, and adversarial robustness. While deep networks are capable of memorizing noise data, our results suggest that they tend to prioritize learning simple patterns first. In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data. Our analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.},
1267  title = {A closer look at memorization in deep networks},
1271  pages = {10},
1272  year = {2018},
1273  author = {Morcos, Ari and Raghu, Maithra and Bengio, Samy},
1274  booktitle = {Advances in {Neural} {Information} {Processing} {Systems} ({NeurIPS})},
1275  language = {en},
1276  abstract = {Comparing different neural network representations and determining how representations evolve over time remain challenging open questions in our understanding of the function of neural networks. Comparing representations in neural networks is fundamentally difficult as the structure of representations varies greatly, even across groups of networks trained on identical tasks, and over the course of training. Here, we develop projection weighted CCA (Canonical Correlation Analysis) as a tool for understanding neural networks, building off of SVCCA, a recently proposed method [22]. We first improve the core method, showing how to differentiate between signal and noise, and then apply this technique to compare across a group of CNNs, demonstrating that networks which generalize converge to more similar representations than networks which memorize, that wider networks converge to more similar solutions than narrow networks, and that trained networks with identical topology but different learning rates converge to distinct clusters with diverse representations. We also investigate the representational dynamics of RNNs, across both training and sequential timesteps, finding that RNNs converge in a bottom-up pattern over the course of training and that the hidden state is highly variable over the course of a sequence, even when accounting for linear transforms. Together, these results provide new insights into the function of CNNs and RNNs, and demonstrate the utility of using CCA to understand representations.},
1277  title = {Insights on representational similarity in neural networks with canonical correlation},
1281  pages = {12},
1282  year = {2019},
1283  author = {Ovadia, Yaniv and Fertig, Emily and Ren, Jie and Nado, Zachary and Sculley, D and Nowozin, Sebastian and Dillon, Joshua and Lakshminarayanan, Balaji and Snoek, Jasper},
1284  booktitle = {Advances in {Neural} {Information} {Processing} {Systems} ({NeurIPS})},
1285  language = {en},
1286  abstract = {Modern machine learning methods including deep learning have achieved great success in predictive accuracy for supervised learning tasks, but may still fall short in giving useful estimates of their predictive uncertainty. Quantifying uncertainty is especially critical in real-world settings, which often involve input distributions that are shifted from the training distribution due to a variety of factors including sample bias and non-stationarity. In such settings, well calibrated uncertainty estimates convey information about when a model’s output should (or should not) be trusted. Many probabilistic deep learning methods, including Bayesian-and nonBayesian methods, have been proposed in the literature for quantifying predictive uncertainty, but to our knowledge there has not previously been a rigorous largescale empirical comparison of these methods under dataset shift. We present a largescale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of dataset shift on accuracy and calibration. We find that traditional post-hoc calibration does indeed fall short, as do several other previous methods. However, some methods that marginalize over models give surprisingly strong results across a broad spectrum of tasks.},
1287  title = {Can you trust your model's uncertainty?  {Evaluating} predictive uncertainty under dataset shift},
1291  pages = {345--356},
1292  year = {2002},
1293  author = {Papadopoulos, Harris and Proedrou, Kostas and Vovk, Volodya and Gammerman, Alex},
1294  publisher = {Springer},
1295  booktitle = {European {Conference} on {Machine} {Learning} ({ECML})},
1296  title = {Inductive confidence machines for regression},
1300  year = {2019},
1301  author = {Romano, Yaniv and Patterson, Evan and Candes, Emmanuel},
1302  booktitle = {Advances in {Neural} {Information} {Processing} {Systems} ({NeurIPS})},
1303  url = {},
1304  volume = {32},
1305  title = {Conformalized quantile regression},
1309  year = {1999},
1310  author = {Saunders, Craig and Gammerman, Alexander and Vovk, Volodya},
1311  booktitle = {International {Joint} {Conference} on {Artificial} {Intelligence} ({IJCAI})},
1312  title = {Transduction with confidence and credibility},
1316  pages = {1521--1528},
1317  year = {2011},
1318  month = {June},
1319  author = {Torralba, Antonio and Efros, Alexei A.},
1320  publisher = {IEEE},
1321  booktitle = {{IEEE} {Conference} on {Computer} {Vision} and {Pattern} {Recognition} ({CVPR})},
1322  urldate = {2022-10-10},
1323  language = {en},
1324  abstract = {Datasets are an integral part of contemporary object recognition research. They have been the chief reason for the considerable progress in the field, not just as source of large amounts of training data, but also as means of measuring and comparing performance of competing algorithms. At the same time, datasets have often been blamed for narrowing the focus of object recognition research, reducing it to a single benchmark performance number. Indeed, some datasets, that started out as data capture efforts aimed at representing the visual world, have become closed worlds onto themselves (e.g. the Corel world, the Caltech101 world, the PASCAL VOC world). With the focus on beating the latest benchmark numbers on the latest dataset, have we perhaps lost sight of the original purpose? The goal of this paper is to take stock of the current state of recognition datasets. We present a comparison study using a set of popular datasets, evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value. The experimental results, some rather surprising, suggest directions that can improve dataset collection as well as algorithm evaluation protocols. But more broadly, the hope is to stimulate discussion in the community regarding this very important, but largely neglected issue.},
1325  doi = {10.1109/CVPR.2011.5995347},
1326  url = {},
1327  isbn = {978-1-4577-0394-2},
1328  title = {Unbiased look at dataset bias},
1329  address = {Colorado Springs, CO, USA},
1333  year = {1999},
1334  author = {Vovk, Volodya and Gammerman, Alexander and Saunders, Craig},
1335  booktitle = {International {Conference} on {Machine} {Learning} ({ICML})},
1336  title = {Machine-learning applications of algorithmic randomness},
1340  pages = {108931},
1341  note = {Publisher: Elsevier},
1342  year = {2022},
1343  author = {Dietterich, Thomas G. and Guyer, Alex},
1344  journal = {Pattern Recognition},
1345  shorttitle = {The familiarity hypothesis},
1346  volume = {132},
1347  title = {The familiarity hypothesis: {Explaining} the behavior of deep open set methods},
1351  year = {2008},
1352  author = {Papadopoulos, Harris},
1353  publisher = {Citeseer},
1354  booktitle = {Tools in artificial intelligence},
1355  shorttitle = {Inductive conformal prediction},
1356  title = {Inductive conformal prediction: {Theory} and application to neural networks},
1360  pages = {824--833},
1361  year = {2020},
1362  author = {Thagaard, Jeppe and Hauberg, Søren and Vegt, Bert van der and Ebstrup, Thomas and Hansen, Johan D. and Dahl, Anders B.},
1363  publisher = {Springer},
1364  booktitle = {International {Conference} on {Medical} {Image} {Computing} and {Computer}-{Assisted} {Intervention}},
1365  title = {Can you trust predictive uncertainty under real dataset shifts in digital pathology?},
1369  pages = {107--115},
1370  year = {2021},
1371  month = {March},
1372  author = {Zhang, Chiyuan and Bengio, Samy and Hardt, Moritz and Recht, Benjamin and Vinyals, Oriol},
1373  journal = {Communications of the ACM},
1374  urldate = {2022-10-10},
1375  number = {3},
1376  language = {en},
1377  abstract = {Despite their massive size, successful deep artificial ­neural networks can exhibit a remarkably small gap between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training.},
1378  doi = {10.1145/3446776},
1379  url = {},
1380  issn = {0001-0782, 1557-7317},
1381  volume = {64},
1382  title = {Understanding deep learning (still) requires rethinking generalization},
1386  keywords = {Computer Science - Machine Learning},
1387  note = {arXiv:1611.03530 [cs]},
1388  year = {2017},
1389  month = {February},
1390  author = {Zhang, Chiyuan and Bengio, Samy and Hardt, Moritz and Recht, Benjamin and Vinyals, Oriol},
1391  publisher = {arXiv},
1392  urldate = {2022-10-10},
1393  language = {en},
1394  abstract = {Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training.},
1395  url = {},
1396  title = {Understanding deep learning requires rethinking generalization},
1400  pages = {14},
1401  keywords = {government, us-dod},
1402  year = {2018},
1403  author = {Mattis, Jim},
1404  language = {en},
1405  title = {Summary of the 2018 {National} {Defense} {Strategy}},
1409  pages = {17},
1410  year = {2019},
1411  month = {February},
1412  author = {Blackburn, R Alan},
1413  language = {en},
1414  url = {},
1415  title = {Summary of the 2018 {Department} of {Defense} {Artificial} {Intelligence} {Strategy}},


arXiv:2302.07872v1 [cs.CY]
License: cc-by-sa-4.0

Related Posts

What does it mean to be a responsible AI practitioner: An ontology of roles and skills

What does it mean to be a responsible AI practitioner: An ontology of roles and skills

Introduction With the rapid growth of the AI industry, the need for AI and AI ethics expertise has also grown.

If our aim is to build morality into an artificial agent, how might we begin to go about doing so?

If our aim is to build morality into an artificial agent, how might we begin to go about doing so?

Introduction If our aim is to build morality into an artificial agent, or machine, how might we begin to go about doing so?

How Do AI Timelines Affect Existential Risk?

How Do AI Timelines Affect Existential Risk?

Introduction Recent progress in AI suggests that artificial general intelligence (AGI) that is as capable as humans on a wide variety of tasks is likely to be created this century.