Issues with the Availability and Quality of Training Sets in AI Contracting

Zach Frankel | February 12, 2021
Colorful software or web code on a computer monitor

A Primer on Training Sets and Machine Learning.  

Merging computer processing and the practice of transactional law is a concept that has been around longer than you might think.  Technologies for automating contract management and drafting tasks existed as early as the 1970’s, and consumer-facing software for automating tasks like incorporation and estate-planning have been a fixture in the legal service market for well over a decade.  Both the number and the utility of legal technology innovations used in contracting are growing exponentially, and much of that growth is being driven by machine learning technology.  

Machine learning is a form of artificial intelligence in which computer algorithms are utilized in a way such that the software “learns” and improves performance on its own.  Training sets are the initial samples presented to the software, and for the purposes of AI contracting, training sets take the form of documents – contracts, forms, filings, etc., provided to the software.  The quantity of data included in a training set can affect both the software’s ability to produce new outputs and the accuracy of those outputs.  In other words, the more contracts fed to machine learning software, the better it can learn to discover patterns in order to manage and classify contracts, analyze qualities such as risk, suggest changes, or even predict operative results from contract drafts.  The more complex and variable the contract-related output to be produced by the software is, the larger the training set required in order for the software to function accurately.  

The quality of data included in the training set also affects the software’s ability to perform its purpose.  The term “Garbage in, garbage out” describes the concept that the overall quality of the training set used to develop a software’s learning capability will affect both its ability to accurately analyze problems and the overall quality of its outputs.  Even setting aside judgments on the quality of the inputs, form and content of outputs will often mirror the inputs, especially when dealing with textual data. In short, the value of contracting software will depend on the quality (including context and similarity) and quantity of training set data. 

Issues with Contract Language.  

The momentum for another fascinating shift in contract drafting is building alongside the proliferation of artificial intelligence – a push for simplifying contract language.  Legal tech companies are investing in adapting Natural Language Processing (NLP) techniques to more easily code legalese and convoluted sentence structures commonly found in legal contracts.  LawGeex, for instance, developed algorithms that can comprehend unfamiliar legalese, and the company’s product can perform contract review tasks more accurately than human lawyers.  The need for this interpretation creates barriers for companies that are not as well-funded or do not have the ability to implement the multi-year lead time required to train such systems. 

While technical contract language alone presents issues for training set availability, interpreting complex legal language is not just an issue for software and machines.  The digitalization of consumer-facing contracts has fueled a demand for simpler, more natural language used in the day-to-day lives of customers.  There is growing support for reducing the complexity of overall drafting language and contracting organizations like World Commerce and Contracting have already issued guidance for implementing this shift. Ironically, this trend presents potential future issues for software built on current training sets that include convoluted language.  As Professors Daniel W. Linna Jr. and Helena Haapio put it, “[W]e have a disconnect between people developing AI for contracting and people working to improve contracting through simplification and redesign.”  This disconnect impacts the quality of training sets as we might soon face a situation where the expectations of legal language that impact how companies train software do not match society’s expectations of contract language.

Issues with Availability and Quality of Contracts.

The availability of contracts on which to train software is important for any developers of new AI contracting software because machine learning processes can learn more from larger training sets.  The issue of availability is a simple one: the vast majority of contracts are not public documents.  Additionally, public databases that do exist, such as the SEC’s EDGAR database, relate mostly to specific practices and types of contracting parties. Still, LexPredict, Bloomberg and Contract Standards are among the many legal tech providers who utilize public databases to account for at least part of their training sets.  Public filings are included in databases for reasons of disclosure rather thanfor their intrinsic value or paradigmatic drafting, and their use for training set data presents issues of quality.

Also problematic is the fact that public filings often represent only an end product.  Especially for emerging predictive applications of AI contracting, AI needs to, first, learn which contract provisions are standard in order to create baseline precedent documents that can be customized and, second, gather information about specific situations and conditions in which specific, non-standard customizations are to be applied.  The emphases on both disclosure over process and the end product create a paradox where outcomes are transparent but process is opaque, creating yet another potential difficulty in utilizing public documents for training purposes.

Issues for Specific Practices and Contract Types.

There are both quality and quantity (availability) issues that affect the supply of satisfactory training sets for particular types of contracts.  As previously stated, legal tech companies pre-train their software with large, often public, data sets.  Software often will not properly recognize unfamiliar terms and contracts that are not part of the existing set, and so the software will not be useful for more niche or particularly complex contracts.  Additionally, when developing newer predictive contracting technology, it will be difficult to train models to accurately predict outputs for circumstances which occur infrequently, such as “bet the company” situations.    Finally, legal market conditions that arise within particular practice areas might present unique roadblocks to innovation, leading to less generation of training sets.  In his paper on the inefficiency of precedent selection in the M&A field, Professor Robert Anderson IV notes that pressure to standardize agreements often comes from the client side, and firms are less likely to invest in standardizing deal documentation via technology in matters like bankruptcy or acquisitions where clients are unlikely to be repeat players.  He also notes that a particularly small number of firms dominate the M&A field, and reputational barriers may exclude new and innovative firms from entering the marketplace and leveraging technology to challenge the status quo.

Conclusion and Beyond.

This post briefly surveys a few contract-specific issues that might limit the quality and availability of training sets for AI contract drafting, but there are undoubtedly many more.  In addition to technological issues, other issues relate to the way law firms, other legal service providers, and clients behave.  In a future post, I hope to survey more of these, along with a couple of proposed solutions for issues with training sets for AI contracting.  Suggested solutions include affording greater IP protection to contracts and sharing contract management resources among firms.  These solutions, if implemented, may in turn pose new issues, such as ones involving cost-sharing and privacy concerns.  AI contracting, following the path of machine learning in a more general sense, is a rapidly advancing technology, and new issues with its implementation will likely continue to arise.  

Zach Frankel is a second-year law student at Northwestern Pritzker School of Law.