This project is related to the pre-print paper: "ChatGPT may Pass the Bar Exam soon, but has a Long Way to Go for the LexGLUE benchmark" by Chalkidis (2023).
Following the work "GPT Takes the Bar Exam" of Bommarito and Katz (2023), we evaluate the latest OpenAI's GPT-3.5 model gpt-3.5-turbo
, (v. March 2023), the first available ChatGPT, on legal text classification tasks from the LexGLUE benchmark in a zero-shot fashion providing examples in a templated instruction-following format, similar to those used by "Scaling Instruction-Finetuned Language Models" (Chung et al., 2022). We find that ChatGPT achieves an average micro-F1 score of 49.0% across LexGLUE tasks in a zero-shot setting, significantly in excess of the baseline guessing rates, while the model performs exceptionally well in some datasets achieving micro-F1 scores of 62.8% and 70.1% in the ECtHR B and LEDGAR datasets.
If you mention or build on top of this work, please cite:
@article{chalkidis-2023-lexglue-chatgpt,
title={ChatGPT may Pass the Bar Exam soon, but has a Long Way to Go for the LexGLUE benchmark},
author={Ilias Chalkidis},
journal={SSRN},
url={https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4385460},
year={2023},
}
We evaluate GPT-3.5 models in a zero-shot learning setup by providing LexGLUE examples in an instruction-following format.
Example from UNFAIR-TOS:
Given the following sentence from an online Term of Services:
"if you are a resident of the european union (eu), please note that we offer this alternative
dispute resolution process, but we can not offer you the european commission dispute platform
as we do not have an establishment in the eu."
The sentence is unfair with respect to some of the following options:
- Limitation of liability
- Unilateral termination
- Unilateral change
- Content removal
- Contract by using
- Choice of law
- Jurisdiction
- Arbitration
- None
The relevant options are: [None]
Example from LEDGAR:
Given the following contractual section:
"The corporate law of the State of Delaware shall govern all issues and questions concerning the relative rights of the Corporation and its stockholders.
All other issues and questions concerning the construction, validity, interpretation and enforcement of this Agreement and the exhibits and schedules hereto
shall be governed by, and construed in accordance with, the laws of the State of New York, without giving effect to any choice of law or conflict of law rules
or provisions (whether of the State of New York or any other jurisdiction) that would cause the application of the laws of any
jurisdiction other than the State of New York."
There is an appropriate section title out of the following options:
- Adjustments
- Agreements
- Amendments
- Anti-Corruption Laws
- Applicable Laws
- Approvals
- Arbitration
- Assignments
- Assigns
...
- Governing Laws
- Waiver Of Jury Trials
- Waivers
- Warranties
The most appropriate option is: [Governing Laws]
For all LexGLUE tasks, we provide code to generate a templated instruction-following version of the dataset based on the following template:
INTRODUCTORY_PART
"[Input Text]"
OPTIONS_PRESENTATION_TEXT
- [Label 1]
- [Label 2]
...
- [Label L]
QUESTION_TEXT
In which case, you can alter the following parts:
INTRODUCTORY_PART = 'Given the following sentence from an online Term of Services:'
OPTIONS_PRESENTATION_TEXT = 'The sentence is unfair with respect to some of the following options:'
QUESTION_TEXT = 'The relevant options are:'
to generate different templated prompts by updating the build_instructions/TEMPLATES.py
.
You have to follow three easy steps:
- To generate a templated instruction-following version of a dataset, e.g., EURLEX, you have to call the related script:
python build_instructions/eurlex_instructions.py
- To call the OpenAI API and get responses (predictions), you have to call the following script:
python call_openai.py --dataset_name eurlex --model_name gpt-3.5-turbo
Please provide an OpenAI API key: [Copy+Paste API key from https://platform.openai.com/account/api-keys]
You may find a list of available models (here), so far we provide results for gpt-3.5-turbo
, which costs
- To evaluate the performance of a GPT-3.5 model, you have to call the following script:
python evaluate_perfomance.py --dataset_name eurlex --model_name gpt-3.5-turbo --multi_label true
Dataset Name | Usage Cost |
---|---|
ECtHR A | $4.43 |
ECtHR B | $4.43 |
SCOTUS | $8.49 |
EURLEX | $6.15 |
LEDGAR | $1.34 |
UNFAIR-ToS | $0.41 |
CASEHOLD | $0.99 |
Total | $26.24 |
Following Chalkidis et al. (2022), we report micro- and macro-F1 scores:
Dataset Name | Zero-shot (Baseline Prompts -gpt-3.5-turbo ) |
Supervized (Chalkidis et al., 2022 - LegalBERT ) |
---|---|---|
ECtHR A | 55.3 / 50.6 | 70.0 / 64.0 |
ECtHR B | 62.8 / 55.3 | 80.4 / 74.7 |
SCOTUS | 43.8 / 42.0 | 76.4 / 66.5 |
EURLEX | 32.5 / 21.1 | 72.1 / 57.4 |
LEDGAR | 70.1 / 56.7 | 88.2 / 83.0 |
UNFAIR-ToS | 41.4 / 22.2 | 96.0 / 83.0 |
CASEHOLD | 59.3 / 59.3 | 75.3 / 75.3 |
Average | 49.0 / 37.1 | 78.9 / 70.8 |