Baby-CoThought

This repository contains the code for the paper:

Baby's CoThought: Leveraging Large Language Models for Enhanced Reasoning in Compact Models.

In this work, we apply our "CoThought" pipeline to pretrain a Baby Language Model (BabyLM) with human-like smaller corpus data.

The pretraining data is provided by Warstadt et al. (2023) in the framework of the BabyLM Challenge, which has the goal of sample-efficient pretraining on a developmentally plausible corpus at a small human-like data scale.

Creative NLU-Example Generation

Download babylm_data, run ./CNLU-EG/data/text/cat_data.py to merge them.

cd ./babylm_data/babylm_100M
cat aochildes.train bnc_spoken.train cbt.train cbt.train children_stories.train open_subtitles.train qed.train switchboard.train > merged_data.txt
python cat_data.py merged_data.txt text.txt

Then we can get the raw data for the next step.

Use LLMs to generate the new dataset consisting of NLU-Examples.
```
cd ./CNLU-EG/scripts/text
bash cot_sampling.sh
```

Pre-Train the BabyLM

Pre-train the BabyLM with our generated dataset.

The generated dataset can be downloaded from here.

cd ./pretrain
python RoBERTa.py RoBERTa_config.json

Evaluation

Evaluate the trained BabyLM on a shared pipeline, hosted at this GitHub link.
The public validation data utilized is a blend of BLiMP and (Super)GLUE tasks. Additional tasks will be held out for the final evaluation of submitted models.

Citation

If you found the resources in this repository useful, please cite:

@inproceedings{zhang-etal-2023-babys,
    title = "Baby{'}s {C}o{T}hought: Leveraging Large Language Models for Enhanced Reasoning in Compact Models",
    author = {Zhang, Zheyu  and
              Yang, Han  and
              Ma, Bolei  and
              R{\"u}gamer, David  and
              Nie, Ercong},
    editor = "Warstadt, Alex  and
              Mueller, Aaron  and
              Choshen, Leshem  and
              Wilcox, Ethan  and
              Zhuang, Chengxu  and
              Ciro, Juan  and
              Mosquera, Rafael  and
              Paranjabe, Bhargavi  and
              Williams, Adina  and
              Linzen, Tal  and
              Cotterell, Ryan",
    booktitle = "Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.conll-babylm.13",
    pages = "130--142",
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
CNLU-EG		CNLU-EG
eval/evaluation-pipeline-main		eval/evaluation-pipeline-main
figures		figures
pretrain		pretrain
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Baby-CoThought

Contents

Creative NLU-Example Generation

Pre-Train the BabyLM

Evaluation

Citation

About

Releases

Packages

Contributors 3

Languages

oooranz/Baby-CoThought

Folders and files

Latest commit

History

Repository files navigation

Baby-CoThought

Contents

Creative NLU-Example Generation

Pre-Train the BabyLM

Evaluation

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages