This post was originally published in Insurance CIO Outlook

Ask an account manager to cross out the parts of ACORD 125 they don't find relevant and you'll get a lot of red ink. Unfortunately the ACORDs, along with IVANS downloads, are the closest we have to an industry standard data representation in the insurance space. Machine learning methods thrive in the presence of well-structured, high quality data, and in retail insurance such data are in short supply.

At Newfront we are focused on how we can best represent the coverage offered by our carriers and operational information regarding our clients. We use these data to build a better experience for our clients when they transact and manage insurance.

Our concerns as a brokerage are quite different than those on the underwriting side, where data models are more mature and innovation comes from acquiring and using novel sources of data to make better underwriting decisions. Technical challenges in our domain include automatically generating a comprehensive summary of insurance for a client or determining the appropriate marketing strategy for a client based on their risk characteristics and historical carrier behavior.

We've broken down the world of insurance data into two large buckets: coverage provided by carriers, and the business operations of our insureds. In both cases we've built technology to help gather and structure this information.

Starting with coverage data, the source of truth for these data lies in the quote and policy documents generated by carriers. Typically account managers must read these documents in order to capture the coverage and pricing information they contain into an agency management system. We've built a tool that automatically parses PDF documents to extract structured coverage and pricing data, obviating the manual interpretation and data entry steps of quote and policy ingestion.

We debated two approaches to building this form of automation, a rule-based system and one based on natural language processing and machine learning methods. Our goal is to minimize the amount of time and effort our account managers need to put into reviewing the extracted data by maximizing our ability to estimate the extraction's accuracy. Rule-based systems perform well on samples that match their target representation (e.g., Hartford workers compensation quotes) and typically quite poorly on samples outside that target (a notable exception is when carriers share underlying quote templates). Machine learning systems do a better job generalizing outside their training corpus, but provide no guarantees on the accuracy of any given extraction.

To put it another way, across an entire document corpus a rule-based approach may have only 40% total accuracy, but near 100% accuracy on the documents that it does extract a significant amount of data from. In contrast a machine learning approach could have 60% accuracy across the corpus but you'd have no guarantee of the accuracy of extraction from a given document. The former accuracy characteristics are preferable for us because we can separate our quotes and policies into two tracks: those that require little to no human review, and those that require full human review. With the lack of a strong accuracy guarantee in the latter case, we'd always need a significant human review component.

We enhance our data extraction process with validation. We know the data types of the fields we're trying to parse, and can flag if there's a type mismatch (alphabetic characters where we expect a dollar amount, or a dollar amount where we expect a date). As for the rules themselves, our rules engine enables a variety of extraction methods, including regular expressions, spatial instructions (extract all text in this bounding box on page 2), and inline custom functions.

The second bucket of data we have worked to develop a representation for is operational information regarding our clients. Here the challenge is its breadth and diversity. As a restaurant you may be asked about the details of your UL300 Suppression system maintenance and what percent blend of your sales are alcohol, whereas as a construction company you'll be asked about personal protective equipment and percentage of work subcontracted.

In the process of digitizing hundreds of carrier applications and supplementals we've built a database with tens of thousands of questions that carriers ask of our clients. This creates a deduplication problem — we don't want to separately represent the questions "What year did you incorporate your business?" and "Year of business incorporation."

To assist our application digitization team in avoiding such duplication, we've used NLP methods to detect and highlight potential dupes. We do common preprocessing (stemming, tokenization, and removing stop words) followed by tf-idf to highlight existing questions similar to a given question wording. The digitization team then has the option to reuse the existing question rather than creating a new one specific to the application that they are working on.

One of the things that most excited me when I joined Newfront was the massive data representation challenge that insurance presents. We need ontologies that accurately reflect the operations of every business in every industry, and all the possible coverages provided by carriers. In building these ontologies we will unlock our ability to train sophisticated machine learning models on top of our newly structured data.

Josh Lewis joined Newfront as an early engineer and now, after working in product and marketing roles, runs Newfront's Portland, OR engineering office. Before Newfront Josh led product at Alpine Data (acquired by TIBCO), and was an engineering manager at Ayasdi. Josh holds a PhD in Cognitive Science from UC San Diego, where he studied how people interact with and interpret unsupervised machine learning algorithms.