PyPremise allows to easily identify patterns or explanations of where a machine learning classifier performs well and where it fails. It is independent of any specific classifier or architecture. It has been evaluated both on NLP text tasks and data with arbitrary binary features.
For a recent Visual Question Answering model, it, e.g., identifiers that the model struggles with counting, visual orientation and higher reasoning questions:
pattern | example from the dataset |
---|---|
UNK | how are the UNK covered |
(how, many) | how many elephants are there |
(what, ⓧ(color, colors, colour) ) | what color is the bench |
(on, top, of) | what is on the top of the cake |
(left, to) | what can be seen to the left |
(on, wall, hanging) | what is hanging on the wall |
(how, does, look) | how does the woman look |
(what, does, ⓧ(say, like, think, know, want) ) | what does the sign say |
For more examples, you can check out the original publication of Premise.
This repository provides an easy to use interface in Python so that you can run Premise with just a few lines of code.
from pypremise import Premise, data_loaders
premise_instances, _, voc_index_to_token = data_loaders.get_dummy_data()
premise = Premise(voc_index_to_token=voc_index_to_token)
patterns = premise.find_patterns(premise_instances)
for p in patterns:
print(p)
# prints two patterns
# (How) and (many) towards misclassification
# (When) and (was) and (taken) towards correct classification
If you are working on text data, you can also use word embeddings to improve the results:
embedding, dim = data_loaders.create_fasttext_mapping("/path/to/fasttext.bin", voc_index_to_token)
premise = Premise(voc_index_to_token=voc_index_to_token, embedding_index_to_vector=embedding,
embedding_dimensionality=dim, max_neighbor_distance=1)
patterns = premise.find_patterns(premise_instances)
# finds the additional pattern
# (When) and (was) and (taken) and (photo-or-photograph)
PyPremise provides you with helper methods to load data from different sources like numpy arrays or tokenized text files. See below for more examples.
Install a recent version of Python, then just run
pip install pypremise
Currently, only Linux (Ubuntu) is supported in this way. If you want to use it on a different platform like macOS, compile the original Premise for your platform and replace the Premise
executable file in the pypremise
directory here with the one you compiled for your platform. You can then call pip install .
to install PyPremise. If you run into any issues, just contact us.
If you want to use FastText embeddings (optional), please install them following these instructions and then download embeddings for your language here (the .bin file is needed).
If you are looking for the Additional Material of the Premise paper, you can find it here.
In general, you run your classifier on some data (e.g. the development set) and log where the model predicts correctly and where it misclassifies. Then you give this information to PyPremise and it finds patterns or misclassification and correct classification for you.
# Check if the model classifies correctly or not on your dataset
# Adapt this code to your specific setting
premise_features = [instance.x for instance in dev_set]
premise_lables = []
for instance in dev_set:
if model.predict(instance.x) == instance.true_label:
premise_labels.append(1) # correct classification
else:
premise_labels.append(0) # misclassification
# Convert your data to the PyPremise format. Various helper methods exist
# here, we use e.g. from_sparse_index_lists(). Pick the one
# for your data.
premise_instances = data_loaders.from_sparse_index_lists(premise_features, premise_labels)
# run Premise
premise_patterns = Premise().find_patterns(premise_instances)
for pattern in premise_patterns:
print(pattern)
data_loaders
contains various methods to load data and convert it into the format
expected by PyPremise. We will now give a couple of examples. For the full documentation of the methods
and the full list of helper methods, please check the documentation in pypremise/data_loaders.py.
You can load the data from files. Create a file for the features and one for the labels. Each line in the file represents one instance and the feature and the label files must have, therefore, the same length.
In the feature file, put your text, whitespace tokenized. In the label file, put the premise label (0 if the instance was misclassified, 1 if it was correctly classified by your classifier). This could look like:
features.txt
a brown dog .
a black cat .
labels.txt
1
0
You can load this data with
from pypremise import data_loaders
premise_instances, voc_token_to_index, voc_index_to_token = data_loaders.from_tokenized_file("features.txt", "labels.txt", delimiter=" ")
Premise works internally with indices (numbers) not tokens. You can convert from indices to tokens and vice versa with the voc_token_to_index and voc_index_to_token. If you give the mapping to Premise, it will do the conversion of the patterns automatically for you.
from pypremise import Premise
premise = Premise(voc_index_to_token=voc_index_to_token)
Instead of writing the data to files, you can also use lists directly. Based on the previous example, this would look like:
features = [["a", "brown", "dog", "."], ["a", "black", "cat", "."]]
labels = [1, 0]
premise_instances, voc_token_to_index, voc_index_to_token = data_loaders.from_token_lists(features, labels)
The rest stays the same.
You can use word embeddings to get more interesting rules, like ('photo' or 'photograph' or 'picture'). For FastText embeddings, this is already implemented. Just add the FastText lookup and tell Premise to use it. Increasing max_neighbor_distance let's it look for more complex patterns but it will also increase the runtime.
embedding, dim = data_loaders.create_fasttext_mapping("/path/to/fasttext.bin", voc_index_to_token)
premise = Premise(voc_index_to_token=voc_index_to_token, embedding_index_to_vector=embedding,
embedding_dimensionality=dim, max_neighbor_distance=2)
You can also use any other word embeddings of your choice. You just need to provide to Premise the following:
- embedding_index_to_vector: a mapping from an index to its corresponding vector/embedding representation. You can look up the index of a token from voc_token_to_index. The embedding representation is just a list of the numbers.
- embedding_dimensionality: the dimensionality of the embedding vectors.
- max_neighbor_distance: how many neighbors to look at. A number > 0.
PyPremise can use dense matrices and sparse matrix representations, both directly or from files. Just check out the documentation in pypremise.data_loaders for the methods
- from_sparse_index_lists
- from_dense_index_matrix
- from_csv_sparse_index_file
- from_csv_dense_index_file
If you use this tool in your work, we would be happy if you tell us about it!
Also, please cite our work as
@inproceedings{premise,
author = {Michael A. Hedderich and
Jonas Fischer and
Dietrich Klakow and
Jilles Vreeken},
title = {Label-Descriptive Patterns and Their Application to Characterizing
Classification Errors},
booktitle = {International Conference on Machine Learning, {ICML}},
series = {Proceedings of Machine Learning Research},
year = {2022},
url = {https://proceedings.mlr.press/v162/hedderich22a.html}
}
PyPremise (not Premise itself) is published under MIT license.
If you run into any issues, feel free to contact us (email in the paper) or create an issue on GitHub. We are happy to help you out!