Skip to content

Releases: LexPredict/lexpredict-lexnlp

corpus/uspto-sample/0.1

07 Mar 15:25
330b4e1
Compare
Choose a tag to compare
Pre-release

United States Patent and Trademark Office (USPTO) Dataset

Date (ISO 8601): 2022-04-11

The USPTO backgrounds were downloaded using a derivative of this script:
https://github.com/EleutherAI/pile-uspto

This sample contains 4500 text files distributed evenly into 45 directories. Each text file contains the text of a USPTO application background and has been placed into the directory respectively representing the grant's year of issue. These texts were randomly selected from a subset of all backgrounds two thousand or more characters in length.

corpus/sec-edgar-forms-3-4-5-8k-10k-sample/0.1

07 Mar 15:23
330b4e1
Compare
Choose a tag to compare

SEC EDGAR Forms 3, 4, 5, 8-K, 10-K

Date (ISO 8601): 2022-04-19

A sample of SEC EDGAR forms from OpenEDGAR stored in plaintext.

Form Count
3 198
4 198
5 200
8-K 197
10-K 199

corpus/govinfo-fr-2021/0.1

07 Mar 15:28
330b4e1
Compare
Choose a tag to compare
Pre-release

GovInfo Federal Register (2021)

Date (ISO 8601): 2022-04-11

Extracted from: https://www.govinfo.gov/bulkdata/FR/2021

Converted to text using Apache Tika.

corpus/eurlex-sample-10000/0.1

07 Mar 15:29
330b4e1
Compare
Choose a tag to compare
Pre-release

EUR-Lex Document Sample (10,000)

Date (ISO 8601): 2022-04-16

This dataset contains 10,000 EUR-Lex documents downloaded via http://api.epdb.eu/.

  • 5,000 of these documents do contain at least one appearance of the substring "agreement" (case insensitive).
  • 5,000 of these documents do not contain a single appearance the substring "agreement" (case insensitive).

Important excerpts from EUR-Lex's copyright notice are quoted below:

The Commission’s document reuse policy is based on Decision 2011/833/EU. Unless otherwise specified, you can re-use the legal documents published in EUR-Lex for commercial or non-commercial purposes.

The copyright for the editorial content of this website, the summaries of EU legislation and the consolidated texts, which is owned by the EU, is licensed under the Creative Commons Attribution 4.0 International licence​​.

corpus/contract-types/0.1

07 Mar 15:31
330b4e1
Compare
Choose a tag to compare
Pre-release

This dataset contains 2387 text files from SEC EDGAR, each with "agreement" in its file name. The documents have been sorted into the following categories:

  • ADVISORY AGREEMENT
  • AGENCY AGREEMENT
  • ARBITRATION AGREEMENT
  • ASSIGNMENT AGREEMENT
  • ASSUMPTION AGREEMENT
  • COLLABORATION AGREEMENT
  • CONFIDENTIALITY AGREEMENT
  • CONTRIBUTION AGREEMENT
  • DEALER AGREEMENT
  • DEPOSIT AGREEMENT
  • DEVELOPMENT AGREEMENT
  • DISTRIBUTION AGREEMENT
  • EMPLOYMENT AGREEMENT
  • ENTITY STRUCTURE
  • ESCROW AGREEMENT
  • EXCHANGE AGREEMENT
  • FEE WAIVER AGREEMENT
  • FRANCHISE AGREEMENT
  • FUND ACCOUNTING AGREEMENT
  • INDEMNIFICATION AGREEMENT
  • INTERCREDITOR AGREEMENT
  • INVESTMENT AGREEMENT
  • JOINT FILING AGREEMENT
  • LEASE AGREEMENT
  • LICENSE AGREEMENT
  • LOAN AGREEMENT
  • MANAGEMENT AGREEMENT
  • MANUFACTURING AGREEMENT
  • MERGER & ACQUISITION AGREEMENT
  • NON-DISCLOSURE AGREEMENT
  • NOT A CONTRACT
  • OPERATING AGREEMENT
  • OTHER CONTRACT
  • PLEDGE AGREEMENT
  • PROMISSORY NOTE
  • REGISTRATION RIGHTS AGREEMENT
  • REPURCHASE AGREEMENTS
  • SALES CONTRACT
  • SECURITIES SALES
  • SECURITY AGREEMENT
  • SERVICES AGREEMENT
  • SERVICING AGREEMENT
  • SETTLEMENT AGREEMENT
  • STOCK OPTION AGREEMENT
  • SUBORDINATION AGREEMENT
  • SUPPLY AGREEMENT
  • TAX ALLOCATION AGREEMENT
  • TRUST AGREEMENT
  • UNDERWRITING AGREEMENT
  • WAIVER AGREEMENT
  • WARRANT AGREEMENT

corpus/caselaw-access-project-ark-ill-nc-nm-subset-144million-characters/0.1

07 Mar 15:34
330b4e1
Compare
Choose a tag to compare

Caselaw Access Project

Randomly-selected subset

Date (ISO 8601): 2022-04-15

This dataset is a partial redistribution of the case_text_open data available from the Caselaw Access Project.

Specifically, this dataset contains a subset of the files from the original Caselaw Access Project dataset. These files were randomly drawn from the original data until the subset reached a sum ~144 million characters, not including newlines or spaces. This was done in order to approximately match the character length of a different dataset.

Permission to redistribute is implicitly included on Caselaw Access Project's "About" page, under Usage & access:

Thus far, Illinois, Arkansas, New Mexico, and North Carolina have made this important and positive shift and, as a result, all historical cases from these jurisdictions are freely available to the public without restriction.

This data was downloaded from the Caselaw Access Project in April 2021.

corpus/bonds/0.1

07 Mar 15:36
330b4e1
Compare
Choose a tag to compare
corpus/bonds/0.1 Pre-release
Pre-release
Merge pull request #71 from LexPredict/2.3.0

2.3.0

corpus/atticus-cuad-v1-plaintext/0.1

07 Mar 15:38
330b4e1
Compare
Choose a tag to compare
Pre-release

The Atticus Project: CUAD v1 Dataset (plaintext only)

Date (ISO 8601): 2022-04-16

This is a partial redistribution of The Atticus Project's CUAD v1 dataset of 510 labeled contracts.

Unlike in the original dataset, the plaintext documents have been organized into their respective contract type categories.

The original dataset is licensed under CC BY 4.0

Notes:

  • The file ADUROBIOTECH,INC_06_02_2020-EX-10.7-CONSULTING AGREEMENT.txt is duplicated as ADUROBIOTECH,INC_06_02_2020-EX-10.7-CONSULTING AGREEMENT(1).txt in both this redistribution and the original dataset.
  • In the original dataset, the file HarpoonTherapeuticsInc_20200312_10-K_EX-10.18_12051356_EX-10.18_Development Agreement.txt has a corresponding PDF named HarpoonTherapeuticsInc_20200312_10-K_EX-10.18_12051356_EX-10.18_Development Agreement_Option Agreement.pdf
  • In the original dataset, the file NETGEAR,INC_04_21_2003-EX-10.16-AMENDMENT TO THE DISTRIBUTOR AGREEMENT BETWEEN INGRAM MICRO AND NETGEAR.txt has a corresponding PDF named NETGEAR,INC_04_21_2003-EX-10.16-AMENDMENT TO THE DISTRIBUTOR AGREEMENT BETWEEN INGRAM MICRO AND NETGEAR-.pdf

corpus/arxiv-abstracts-with-agreement/0.1

07 Mar 15:39
330b4e1
Compare
Choose a tag to compare

ArXiv Dataset Abstract Subsample

Only abstracts containing the substring "agreement"

Date (ISO 8601): 2022-04-15

This dataset contains 69,411 plaintext files, each corresponding to an ArXiv document abstract. Each abstract contains at least one appearance of the substring "agreement".

Each text file in this dataset contains the text of an abstract extracted from the full JSON Lines-formatted dataset (described below). Each file is named after its ArXiv ID and has been given the .txt file extension. In the case where the ArXiv ID contained a forwardslash (/), the forwardslash was replaced with an underscore (_). The text files have a median length of 1057 characters and a mean length of 1100 characters.

The full ArXiv metadata dataset can be found on Kaggle and includes additional information alongside each abstract, such as document authors, comments, DOI, etc. The original dataset was distributed under the CC0: Public Domain license, thereby permitting this modification and redistribution.

pipeline/is-contract/0.1

25 Apr 16:10
Compare
Choose a tag to compare
Pre-release

Scikit-Learn Pipeline

Name Class State
transformerpreprocessor TransformerPreprocessor head_character_n=2000, normalizer=<lexnlp.ml.normalizers.Normalizer object>
transformervectorizer TransformerVectorizer vectorizers=(<lexnlp.ml.vectorizers.VectorizerDoc2Vec object>, <lexnlp.ml.vectorizers.VectorizerKeywordSearch object>)
minmaxscaler MinMaxScaler feature_range=(-1.0, 1.0)
gaussiannb GaussianNB

Training data

Dataset Description Hyperlink
corpus/uspto-sample/0.1 A sample of Patent grant backgrounds from the United States Patent and Trademark Office https://github.com/EleutherAI/pile-uspto
corpus/govinfo-fr-2021/0.1 United States Federal Register, 2021 https://www.govinfo.gov/bulkdata/FR/2021
corpus/contract-types/0.1 A sample of labeled contract types obtained from SEC EDGAR https://www.sec.gov/edgar.shtml
corpus/bonds/0.1 A sample of municipal bonds ?
corpus/caselaw-access-project-ark-ill-nc-nm-subset-144million-characters/0.1 Caselaw Access Project; official, book-published state case law from from Arkansas, Illinois, North Carolina, New Mexico https://case.law/download/bulk_exports/latest/by_jurisdiction/case_text_open/
corpus/atticus-cuad-v1-plaintext/0.1 Atticus CUAD v1 contracts https://www.atticusprojectai.org/cuad
corpus/eurlex-sample-10000/0.1 EUR-Lex documents downloaded via api.epdb.eu https://eur-lex.europa.eu/ http://api.epdb.eu/
corpus/arxiv-abstracts-with-agreement/0.1 ArXiv abstracts containing "agreement" https://www.kaggle.com/datasets/Cornell-University/arxiv
corpus/sec-edgar-forms-3-4-5-8k-10k-sample/0.1 assorted SEC EDGAR filings https://www.sec.gov/edgar.shtml

Metrics

              precision    recall  f1-score   support

       False       1.00      1.00      1.00     20652
        True       0.85      0.91      0.88       580

    accuracy                           0.99     21232
   macro avg       0.93      0.95      0.94     21232
weighted avg       0.99      0.99      0.99     21232

Confusion matrix: true (vertical) vs. predicted (horizontal)

0 1
0 20562 90
1 50 530

Usage

with open('pipeline_is_contract_classifier.cloudpickle', 'rb') as f:
    pipeline_is_contract_classifier: Pipeline = cloudpickle.load(f)

probability_predictor_is_contract: ProbabilityPredictorIsContract = \
    ProbabilityPredictorIsContract(pipeline=pipeline_is_contract_classifier)

probability_predictor_is_contract.is_contract(
    text='...',
    min_probability=0.5,
    return_probability=True,
)