Results Reproduction

In this file, we will show how to reproduce the results in the paper.

Dataset Detail

The process of making the dataset is as follows: firstly, the proportion of all the data is counted, and then the appropriate amount of data is selected in equal proportions as the temporary training set, validation set and test set with ratio of 5:1:4. After that, the training set is downsampled so that the amount of training data for each label is the same.

The random_seed is used to form five different data sets to realize the effect of 5 runs, corresponding to random_index 0 to 4.

And for the case of specifying a sample size of 500 test sets, the operation is the same as above, except that it is scaled down in equal proportions, corresponding to random_index 6.

The detailed data volume is shown in the table below:

Length-of-Stay prediction, random_index 0 to 4

label	1	2	3
train	2980	2980	2980
val	1200	596	425
test	2400	1192	851

Mortality prediction, random_index 0 to 4

label	0	1
train	2100	2100
val	2273	300
test	4546	600
all	22731	3000

Readmission prediction, random_index 0 to 4

label	0	1
train	277	277
val	500	40
test	1000	79
all	5000	396

Length-of-Stay prediction, random_index 6

label	1	2	3
train	335	335	335
val	135	67	48
test	270	134	96

Mortality prediction, random_index 6

label	0	1
train	204	204
val	221	29
test	442	58

Readmission prediction, random_index 6

label	0	1
train	128	128
val	232	18
test	463	37

Table 1

In this table, the case of LLMs with the results of five runs of traditional ML models is represented.

For traditional ML models(take mortality prediction as an example):

python tradition.py \
	--task mortality_pred \
	--dataset mimic3\
	--random_index 0

python tradition.py \
	--task mortality_pred \
	--dataset mimic3\
	--random_index 1

python tradition.py \
	--task mortality_pred \
	--dataset mimic3\
	--random_index 2

python tradition.py \
	--task mortality_pred \
	--dataset mimic3\
	--random_index 3

python tradition.py \
	--task mortality_pred \
	--dataset mimic3\
	--random_index 4

For LLMs(take LLama3-Instruct in mortality prediction as an example):

python test_withprob.py \
	--base_model meta-llama/Meta-Llama-3-8B-Instruct \ 
	--dataset mimic3 \
	--task mortality_pred \
	--mode ORI \
	--random_index 0

python test_withprob.py \
	--base_model meta-llama/Meta-Llama-3-8B-Instruct \ 
	--dataset mimic3 \
	--task mortality_pred \
	--mode ORI \
	--random_index 1

python test_withprob.py \
	--base_model meta-llama/Meta-Llama-3-8B-Instruct \ 
	--dataset mimic3 \
	--task mortality_pred \
	--mode ORI \
	--random_index 2

python test_withprob.py \
	--base_model meta-llama/Meta-Llama-3-8B-Instruct \ 
	--dataset mimic3 \
	--task mortality_pred \
	--mode ORI \
	--random_index 3

python test_withprob.py \
	--base_model meta-llama/Meta-Llama-3-8B-Instruct \ 
	--dataset mimic3 \
	--task mortality_pred \
	--mode ORI \
	--random_index 4

The results will be saved as results/{task}/{dataset}/{task}_result_data_{model_name}_{random_index}

Use the calculate.py to calculate the F1 results and AUROC results. Then calculate the ranges of performance with 95% Confidence Interval to get the results in table 1.

Table 2

In table 2, due to time constraints, instead of testing the results of 5 runs as in table1, only the results of 1 run (random_index = 0) are tested.

For traditional ML models(take mortality prediction as an example):

python tradition.py \
	--task mortality_pred \
	--dataset mimic3\
	--random_index 0

For LLMs(take LLama3-Instruct in mortality prediction as an example):

python test.py \
	--base_model meta-llama/Meta-Llama-3-8B-Instruct \ 
	--dataset mimic3 \
	--task mortality_pred \
	--mode ORI \
	--random_index 0

The results will be saved as results/{task}/{dataset}/{task}_result_data_{model_name}_0.

Then use the calculate.py to calculate the F1 results and AUROC results.

Table 3

Unlike table1, table2, in table3 the test time is much longer due to some Prompt Engineering, so the test set size is controlled to be 500 samples with the same proportion as the real proportion. That is, specify randon_index=6 when running the code.

For traditional ML models(take mortality prediction as an example):

python tradition.py \
	--task mortality_pred \
	--dataset mimic3\
	--random_index 6

For LLMs(take LLama3-Instruct in mortality prediction using ICL method as an example):

python test.py \
	--base_model meta-llama/Meta-Llama-3-8B-Instruct \ 
	--dataset mimic3 \
	--task mortality_pred \
	--mode ICL \
	--random_index 6

The results will be saved as results/{task}/{dataset}/{task}_result_data_{model_name}_6.

Table 4

The same as table 1 but change the dataset to MIMIC-IV.

Table 5, 6, 7, 8

In these tables, the results of training traditional ML Models with different proportions of training sets are represented.

Take the script of mortality prediction, on the MIMIC-III dataset, trained with 40% of the training set, as an example:

python tradition.py \
	--task mortality_pred \
	--dataset mimic3\
	--random_index 0\
    --ratio 0.4

python tradition.py \
	--task mortality_pred \
	--dataset mimic3\
	--random_index 1\
    --ratio 0.4

python tradition.py \
	--task mortality_pred \
	--dataset mimic3\
	--random_index 2\
    --ratio 0.4

python tradition.py \
	--task mortality_pred \
	--dataset mimic3\
	--random_index 3\
    --ratio 0.4

python tradition.py \
	--task mortality_pred \
	--dataset mimic3\
	--random_index 4\
    --ratio 0.4

The results will be saved as results/{task}/{dataset}/{task}_result_data_{model_name}_{random_index}_{ratio}

Use the calculate.py to calculate the F1 results and AUROC results. Then calculate the ranges of performance with 95% Confidence Interval to get the results in table 1.

Figure 3

In this figure, the results of using different temperature to test LLMs are represented.

Take LLama3-Instruct, MIMIC-III dataset, mortality prediction task as an example:

python test.py \
	--base_model meta-llama/Meta-Llama-3-8B-Instruct \ 
	--dataset mimic3 \
	--task mortality_pred \
	--mode ORI \
	--random_index 0 \
	--temperature 0.2

python test.py \
	--base_model meta-llama/Meta-Llama-3-8B-Instruct \ 
	--dataset mimic3 \
	--task mortality_pred \
	--mode ORI \
	--random_index 0 \
	--temperature 0.4

python test.py \
	--base_model meta-llama/Meta-Llama-3-8B-Instruct \ 
	--dataset mimic3 \
	--task mortality_pred \
	--mode ORI \
	--random_index 0 \
	--temperature 0.6

python test.py \
	--base_model meta-llama/Meta-Llama-3-8B-Instruct \ 
	--dataset mimic3 \
	--task mortality_pred \
	--mode ORI \
	--random_index 0 \
	--temperature 0.8

python test.py \
	--base_model meta-llama/Meta-Llama-3-8B-Instruct \ 
	--dataset mimic3 \
	--task mortality_pred \
	--mode ORI \
	--random_index 0 \
	--temperature 1

The results will be saved as results/{task}/{dataset}/{task}_result_data_{model_name}_0_{temperature}.csv

Figure 4

This figure shows the results of fine-tuning the LLMs. The division of the training, validation and test sets for fine-tuning is the same as in the previous case of setting random_index=6(test set size 500) to facilitate training and comparison with previous results

We use LLama Factory to fine-tuning the models. For more detail about the fine-tuning data, please refer to the appendix of the paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

result_reproduction.md

result_reproduction.md

Results Reproduction

Dataset Detail

Table 1

Table 2

Table 3

Table 4

Table 5, 6, 7, 8

Figure 3

Figure 4

Files

result_reproduction.md

Latest commit

History

result_reproduction.md

File metadata and controls

Results Reproduction

Dataset Detail

Table 1

Table 2

Table 3

Table 4

Table 5, 6, 7, 8

Figure 3

Figure 4