GSoC 2021: Automatic Identification and Classification of Contract Data Types with NLP

This summer, I joined the Google Summer of Code (GSoC) program and contributed to the Accord Project.

My project is Automatic Identification And Classification of Contract Data Types with NLP. The goal is to automatically classify data types in natural-language text such as monetary amounts, dates, and legal-specific terms such as agreement parties so that they can be used in a Smart Legal Contract. As a result, users will be able to create a smart legal contract automatically from natural language contract text.

This post is a high-level summary of my progress and experience.

How did I do it?

This section describes the key components of my solution. Source materials are available in my GitHub repository.

A. Data Mapping using NER by RoBERTa.

I built a scaleable Named Entity Recognition (NER) model with Adapter-Transformers based on RoBERTa. This diagram illustrates how a NER model works:

For a more complex input, we would get a result like:

The model also has Active Learning pipelines. Users can define their own custom label (data type, say a Product), upload sample data, and then retrain the NER model. By doing so, the model will recognize their new tag.

B. Suggest about templates by the Classification Model

When users first upload their natural language contract, the Natural Language Processing (NLP) classification model will tell them which smart legal contract template is closest to the raw text. So users can use or fine-tune the Concerto contract model easily.

C. Identification of contract variables by the BERT QA Model

From input contract text, the NLP model will answer questions about the contract.

D. API backend.

Users can also call the NLP model via API. I documented the API with an OpenAPI definition and plenty of examples.

Experience

My favorite part of participating in GSoC was that I stepped out of my comfort zone and learned a lot from it. I worked with people from different countries. My Mentor, Niall, lives in the UK, and other members of the Accord Project come from USA and Australia. All of them are professional and accountable.

I discussed NLP and AI problems with Niall and his colleagues Walter and Jianbo. They gave me many critical suggestions about how to build the model, and guided me through state-of-the-art tools such as Adapter and Snorkel.

Furthermore, Michael, the Accord UX designer, helped me to imagine how people will use my model. Matt, the PM, taught me how to design excellent APIs and Backend. Other people in Accord, such as Jerome and Dan, gave me meaningful feedback on my mockup.

Last but not least, I improved my English by interacting with them online. For example, my English confidence increased this summer by listening and talking at GSoC-related meetings. All of the benefits I gain have been reflected in other parts of my life. For example, my boss on my internship compliments me that I have become more accountable and professional recently.

The most challenging part of participating in GSoC is the time differences. I live in Taiwan, Asia, which is in the UTC+8 timezone; my mentor, is at UTC+0, and other members of the Accord Project live across Europe, USA and Australia. Therefore, it’s challenging to arrange a meeting time. Even though we use doodle and Google Calendar invitations to solve the timezone transform problem. The meeting time is still very challenging in some cases. For example, the Working Group of the Accord Project has a Zoom meeting every week at 4 AM at Taiwan Timezone. So I needed to wake up early to catch up (or I can watch the recording). I learned to pre-record my presentations and to let the host play it, then answer questions afterward, just in case I overslept or feeling drowsy at the meeting time. Thankfully, I never overslept, and my brain is clear enough to answer the questions people gave for my sessions. The weekly meeting time with my mentor, Niall, was at a more acceptable time. We met on Mondays at 6:30 PM Taiwan time which is 10:30 in the UK.

Nevertheless, I cultivated good morning habits by waking up early. Now I usually sleep before 9.PM and wake up before 5.AM. I cherish this biological clock because it can keep me healthy, and my brain works best at 5 AM. I usually write bug-free code in the morning and create many problems if I stay up to code at night!

Thank you all for this summer! Thanks to Niall and his colleagues Walter and Jianbo, and all of the Accord Project contributors

I sincerely hope we can co-work in the future on Accord Project so the NLP model can lower the barrier for users who wants to use Smart Legal Contract in the future!