Skip to content

Toolchain to retrieve and parse privacy policies from websites as described in our paper "Unifying Privacy Policy Detection" by Henry Hosseini, Martin Degeling, Christine Utz, and Thomas Hupperich. Details: https://petsymposium.org/popets/2021/popets-2021-0081.pdf

License

Notifications You must be signed in to change notification settings

ITSec-Uni-Muenster/Unifying-Privacy-Policy-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unifying Privacy Policy Detection Toolchain

This is the accompanying repository for the "Unifying Privacy Policy Detection" paper published in the Pri­va­cy En­han­cing Tech­no­lo­gies Sym­po­si­um (PETS) 2021.

The aim of this project is to support privacy policy researchers with a unified solution for creating privacy policy corpora based on currently available best-practices.

At the moment, we have uploaded the source code as a proof of concept, according with the trained classifiers and vectorizers in English and German. We are planning to provide a pip package as soon as possible in order to ease the application of this toolchain.

Explanation

The toolchain consists of five steps:

  1. Finding potential privacy/cookie policies on websites
  2. Text-from-HTML extraction
  3. Language detection
  4. Key phrase extraction
  5. Classification

Structure of the repository

The current structure of the repository is depicted as follows:

.
|-- LICENSE
|-- README.md
|-- privacy_policy_link_detection
|   |-- README.md
|   |-- custom_command_find_privacy_policies.py
|   `-- demo_privacy_policy_download.py
`-- privacy_policy_toolchain
    |-- code
    |   |-- ppt.py
    |   `-- resources
    |       |-- VotingClassifier_soft_de.pkl
    |       |-- VotingClassifier_soft_en.pkl
    |       |-- trained_vectorizer_de.pkl
    |       `-- trained_vectorizer_en.pkl
    |-- data
    |   `-- privacy_policies
    |-- environment.yml
    |-- feature_list
    |   |-- feature_list_de.txt
    |   `-- feature_list_en.txt
    |-- logs
    |   `-- language_analysis
    `-- results
        `-- classification

The folder resources contains the trained models and the vectorizers for both English and German.

Paper

Henry Hosseini, Martin Degeling, Christine Utz, Thomas Hupperich. "Unifying Privacy Policy Detection." PETS 2021.

Contact

About

Toolchain to retrieve and parse privacy policies from websites as described in our paper "Unifying Privacy Policy Detection" by Henry Hosseini, Martin Degeling, Christine Utz, and Thomas Hupperich. Details: https://petsymposium.org/popets/2021/popets-2021-0081.pdf

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published