Skip to content

PyPremise - Python tool for the Premise algorithm to identify patterns or explanations of where a machine learning classifier performs well and where it fails.

License

Notifications You must be signed in to change notification settings

uds-lsv/PyPremise

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyPremise

PyPremise allows to easily identify patterns or explanations of where a machine learning classifier performs well and where it fails. It is independent of any specific classifier or architecture. It has been evaluated both on NLP text tasks and data with arbitrary binary features.

For a recent Visual Question Answering model, it, e.g., identifiers that the model struggles with counting, visual orientation and higher reasoning questions:

pattern example from the dataset
UNK how are the UNK covered
(how, many) how many elephants are there
(what, ⓧ(color, colors, colour) ) what color is the bench
(on, top, of) what is on the top of the cake
(left, to) what can be seen to the left
(on, wall, hanging) what is hanging on the wall
(how, does, look) how does the woman look
(what, does, ⓧ(say, like, think, know, want) ) what does the sign say

For more examples, you can check out the original publication of Premise.

Example usage

This repository provides an easy to use interface in Python so that you can run Premise with just a few lines of code.

from pypremise import Premise, data_loaders
premise_instances, _, voc_index_to_token = data_loaders.get_dummy_data()
premise = Premise(voc_index_to_token=voc_index_to_token)
patterns = premise.find_patterns(premise_instances)
for p in patterns:
    print(p)
# prints two patterns
# (How) and (many) towards misclassification
# (When) and (was) and (taken) towards correct classification

If you are working on text data, you can also use word embeddings to improve the results:

embedding, dim = data_loaders.create_fasttext_mapping("/path/to/fasttext.bin", voc_index_to_token)
premise = Premise(voc_index_to_token=voc_index_to_token, embedding_index_to_vector=embedding, 
                  embedding_dimensionality=dim, max_neighbor_distance=1)
patterns = premise.find_patterns(premise_instances)
# finds the additional pattern
# (When) and (was) and (taken) and (photo-or-photograph) 

PyPremise provides you with helper methods to load data from different sources like numpy arrays or tokenized text files. See below for more examples.

Installation

Install a recent version of Python, then just run

pip install pypremise

Currently, only Linux (Ubuntu) is supported in this way. If you want to use it on a different platform like macOS, compile the original Premise for your platform and replace the Premise executable file in the pypremise directory here with the one you compiled for your platform. You can then call pip install . to install PyPremise. If you run into any issues, just contact us.

If you want to use FastText embeddings (optional), please install them following these instructions and then download embeddings for your language here (the .bin file is needed).

If you are looking for the Additional Material of the Premise paper, you can find it here.

Example Usages & Documentaiton

General Usage

In general, you run your classifier on some data (e.g. the development set) and log where the model predicts correctly and where it misclassifies. Then you give this information to PyPremise and it finds patterns or misclassification and correct classification for you.

# Check if the model classifies correctly or not on your dataset
# Adapt this code to your specific setting
premise_features = [instance.x for instance in dev_set]
premise_lables = []
for instance in dev_set:
    if model.predict(instance.x) == instance.true_label:
        premise_labels.append(1) # correct classification
    else:
        premise_labels.append(0) # misclassification

# Convert your data to the PyPremise format. Various helper methods exist
# here, we use e.g. from_sparse_index_lists(). Pick the one
# for your data.
premise_instances = data_loaders.from_sparse_index_lists(premise_features, premise_labels)

# run Premise
premise_patterns = Premise().find_patterns(premise_instances)
for pattern in premise_patterns:
    print(pattern)

data_loaders contains various methods to load data and convert it into the format expected by PyPremise. We will now give a couple of examples. For the full documentation of the methods and the full list of helper methods, please check the documentation in pypremise/data_loaders.py.

For NLP / for text from files

You can load the data from files. Create a file for the features and one for the labels. Each line in the file represents one instance and the feature and the label files must have, therefore, the same length.

In the feature file, put your text, whitespace tokenized. In the label file, put the premise label (0 if the instance was misclassified, 1 if it was correctly classified by your classifier). This could look like:

features.txt

a brown dog .
a black cat .

labels.txt

1
0

You can load this data with

from pypremise import data_loaders
premise_instances, voc_token_to_index, voc_index_to_token = data_loaders.from_tokenized_file("features.txt", "labels.txt", delimiter=" ")

Premise works internally with indices (numbers) not tokens. You can convert from indices to tokens and vice versa with the voc_token_to_index and voc_index_to_token. If you give the mapping to Premise, it will do the conversion of the patterns automatically for you.

from pypremise import Premise
premise = Premise(voc_index_to_token=voc_index_to_token)

For NLP / for text from lists

Instead of writing the data to files, you can also use lists directly. Based on the previous example, this would look like:

features = [["a", "brown", "dog", "."], ["a", "black", "cat", "."]]
labels = [1, 0]
premise_instances, voc_token_to_index, voc_index_to_token = data_loaders.from_token_lists(features, labels)

The rest stays the same.

For NLP: using FastText Word Embeddings

You can use word embeddings to get more interesting rules, like ('photo' or 'photograph' or 'picture'). For FastText embeddings, this is already implemented. Just add the FastText lookup and tell Premise to use it. Increasing max_neighbor_distance let's it look for more complex patterns but it will also increase the runtime.

embedding, dim = data_loaders.create_fasttext_mapping("/path/to/fasttext.bin", voc_index_to_token)
premise = Premise(voc_index_to_token=voc_index_to_token, embedding_index_to_vector=embedding, 
                  embedding_dimensionality=dim, max_neighbor_distance=2)

For NLP: using other Word Embeddings

You can also use any other word embeddings of your choice. You just need to provide to Premise the following:

  • embedding_index_to_vector: a mapping from an index to its corresponding vector/embedding representation. You can look up the index of a token from voc_token_to_index. The embedding representation is just a list of the numbers.
  • embedding_dimensionality: the dimensionality of the embedding vectors.
  • max_neighbor_distance: how many neighbors to look at. A number > 0.

For arbitrary machine learning data and data mining

PyPremise can use dense matrices and sparse matrix representations, both directly or from files. Just check out the documentation in pypremise.data_loaders for the methods

  • from_sparse_index_lists
  • from_dense_index_matrix
  • from_csv_sparse_index_file
  • from_csv_dense_index_file

License & Citation

If you use this tool in your work, we would be happy if you tell us about it!

Also, please cite our work as

@inproceedings{premise,
  author    = {Michael A. Hedderich and
               Jonas Fischer and
               Dietrich Klakow and
               Jilles Vreeken},
  title     = {Label-Descriptive Patterns and Their Application to Characterizing
               Classification Errors},
  booktitle = {International Conference on Machine Learning, {ICML}},
  series    = {Proceedings of Machine Learning Research},
  year      = {2022},
  url       = {https://proceedings.mlr.press/v162/hedderich22a.html}
}

PyPremise (not Premise itself) is published under MIT license.

Contact and Help

If you run into any issues, feel free to contact us (email in the paper) or create an issue on GitHub. We are happy to help you out!

About

PyPremise - Python tool for the Premise algorithm to identify patterns or explanations of where a machine learning classifier performs well and where it fails.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages