In this file, we will show how to reproduce the results in the paper.
The process of making the dataset is as follows: firstly, the proportion of all the data is counted, and then the appropriate amount of data is selected in equal proportions as the temporary training set, validation set and test set with ratio of 5:1:4. After that, the training set is downsampled so that the amount of training data for each label is the same.
The random_seed is used to form five different data sets to realize the effect of 5 runs, corresponding to random_index 0 to 4.
And for the case of specifying a sample size of 500 test sets, the operation is the same as above, except that it is scaled down in equal proportions, corresponding to random_index 6.
The detailed data volume is shown in the table below:
- Length-of-Stay prediction, random_index 0 to 4
label | 1 | 2 | 3 |
---|---|---|---|
train | 2980 | 2980 | 2980 |
val | 1200 | 596 | 425 |
test | 2400 | 1192 | 851 |
- Mortality prediction, random_index 0 to 4
label | 0 | 1 |
---|---|---|
train | 2100 | 2100 |
val | 2273 | 300 |
test | 4546 | 600 |
all | 22731 | 3000 |
- Readmission prediction, random_index 0 to 4
label | 0 | 1 |
---|---|---|
train | 277 | 277 |
val | 500 | 40 |
test | 1000 | 79 |
all | 5000 | 396 |
- Length-of-Stay prediction, random_index 6
label | 1 | 2 | 3 |
---|---|---|---|
train | 335 | 335 | 335 |
val | 135 | 67 | 48 |
test | 270 | 134 | 96 |
- Mortality prediction, random_index 6
label | 0 | 1 |
---|---|---|
train | 204 | 204 |
val | 221 | 29 |
test | 442 | 58 |
- Readmission prediction, random_index 6
label | 0 | 1 |
---|---|---|
train | 128 | 128 |
val | 232 | 18 |
test | 463 | 37 |
In this table, the case of LLMs with the results of five runs of traditional ML models is represented.
For traditional ML models(take mortality prediction as an example):
python tradition.py \
--task mortality_pred \
--dataset mimic3\
--random_index 0
python tradition.py \
--task mortality_pred \
--dataset mimic3\
--random_index 1
python tradition.py \
--task mortality_pred \
--dataset mimic3\
--random_index 2
python tradition.py \
--task mortality_pred \
--dataset mimic3\
--random_index 3
python tradition.py \
--task mortality_pred \
--dataset mimic3\
--random_index 4
For LLMs(take LLama3-Instruct in mortality prediction as an example):
python test_withprob.py \
--base_model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset mimic3 \
--task mortality_pred \
--mode ORI \
--random_index 0
python test_withprob.py \
--base_model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset mimic3 \
--task mortality_pred \
--mode ORI \
--random_index 1
python test_withprob.py \
--base_model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset mimic3 \
--task mortality_pred \
--mode ORI \
--random_index 2
python test_withprob.py \
--base_model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset mimic3 \
--task mortality_pred \
--mode ORI \
--random_index 3
python test_withprob.py \
--base_model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset mimic3 \
--task mortality_pred \
--mode ORI \
--random_index 4
The results will be saved as results/{task}/{dataset}/{task}_result_data_{model_name}_{random_index}
Use the calculate.py
to calculate the F1 results and AUROC results. Then calculate the ranges of performance with 95% Confidence Interval to get the results in table 1.
In table 2, due to time constraints, instead of testing the results of 5 runs as in table1, only the results of 1 run (random_index = 0
) are tested.
For traditional ML models(take mortality prediction as an example):
python tradition.py \
--task mortality_pred \
--dataset mimic3\
--random_index 0
For LLMs(take LLama3-Instruct in mortality prediction as an example):
python test.py \
--base_model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset mimic3 \
--task mortality_pred \
--mode ORI \
--random_index 0
The results will be saved as results/{task}/{dataset}/{task}_result_data_{model_name}_0
.
Then use the calculate.py
to calculate the F1 results and AUROC results.
Unlike table1, table2, in table3 the test time is much longer due to some Prompt Engineering, so the test set size is controlled to be 500 samples with the same proportion as the real proportion. That is, specify randon_index=6
when running the code.
For traditional ML models(take mortality prediction as an example):
python tradition.py \
--task mortality_pred \
--dataset mimic3\
--random_index 6
For LLMs(take LLama3-Instruct in mortality prediction using ICL method as an example):
python test.py \
--base_model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset mimic3 \
--task mortality_pred \
--mode ICL \
--random_index 6
The results will be saved as results/{task}/{dataset}/{task}_result_data_{model_name}_6
.
The same as table 1 but change the dataset to MIMIC-IV.
In these tables, the results of training traditional ML Models with different proportions of training sets are represented.
Take the script of mortality prediction, on the MIMIC-III dataset, trained with 40% of the training set, as an example:
python tradition.py \
--task mortality_pred \
--dataset mimic3\
--random_index 0\
--ratio 0.4
python tradition.py \
--task mortality_pred \
--dataset mimic3\
--random_index 1\
--ratio 0.4
python tradition.py \
--task mortality_pred \
--dataset mimic3\
--random_index 2\
--ratio 0.4
python tradition.py \
--task mortality_pred \
--dataset mimic3\
--random_index 3\
--ratio 0.4
python tradition.py \
--task mortality_pred \
--dataset mimic3\
--random_index 4\
--ratio 0.4
The results will be saved as results/{task}/{dataset}/{task}_result_data_{model_name}_{random_index}_{ratio}
Use the calculate.py
to calculate the F1 results and AUROC results. Then calculate the ranges of performance with 95% Confidence Interval to get the results in table 1.
In this figure, the results of using different temperature to test LLMs are represented.
Take LLama3-Instruct, MIMIC-III dataset, mortality prediction task as an example:
python test.py \
--base_model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset mimic3 \
--task mortality_pred \
--mode ORI \
--random_index 0 \
--temperature 0.2
python test.py \
--base_model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset mimic3 \
--task mortality_pred \
--mode ORI \
--random_index 0 \
--temperature 0.4
python test.py \
--base_model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset mimic3 \
--task mortality_pred \
--mode ORI \
--random_index 0 \
--temperature 0.6
python test.py \
--base_model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset mimic3 \
--task mortality_pred \
--mode ORI \
--random_index 0 \
--temperature 0.8
python test.py \
--base_model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset mimic3 \
--task mortality_pred \
--mode ORI \
--random_index 0 \
--temperature 1
The results will be saved as results/{task}/{dataset}/{task}_result_data_{model_name}_0_{temperature}.csv
This figure shows the results of fine-tuning the LLMs. The division of the training, validation and test sets for fine-tuning is the same as in the previous case of setting random_index=6
(test set size 500) to facilitate training and comparison with previous results
We use LLama Factory to fine-tuning the models. For more detail about the fine-tuning data, please refer to the appendix of the paper.