-
Notifications
You must be signed in to change notification settings - Fork 23
Home
atarashi ~ 新しい (あたらしい)
The main goal is to detect one (or more, but that for later) license files in a text file. For this a collection of license text files exists.
The idea of detection is to actually determine the particular licensing of a file. Just detection that one license is there is not enough.
Therefore we have a text file f
and a set of licenses L
, containing license texts l_i
. (Obviously markdown cannot do subscript / index notation at normal letters.
In order to determine license relevant texts and also particluar licenses, there will be
- A collection of all words
W
found inA l_i e L
(allquantor ... element of) - A collection of all words
V
used on normal english language, taken from some data material from the Internet
The following steps then are applied to a file and comparing it with all license files:
- For a given
f
, a so-called frequency of words is calculated - For this frequency, a score is calculated which is looped over all
l_i
ofL
- Considering every word found, meaning to loop over all words:
- The minimum of two, first based on the occurrence in file and second the occurrence in the actual
l_i
ofL
- Then a TF-IDF coefficient (see below how to calculate the weight) is applied to that element
- This results in a value
- The minimum of two, first based on the occurrence in file and second the occurrence in the actual
- Summing up all values results in a score for a relation of
l
and the actuall_i
- Considering every word found, meaning to loop over all words:
- All scores form a list, then the highest score refers to the best matching
l_i
We consider a coefficient to apply a weight for the individual words:
- is it license relevant in general opposed to normal English langauge (or generally the normal context of a software distribution)
- is it relevant just for a particular license text.
Then the coefficient is more then zero, indicating a generally license relevant word. It will be high value for a words that matches a particular license text (e.g. info-zip
) and it be low for popular terms in the specific domain of licensing (e.g. distribution
).
For the actual calculation of this coefficient, the calculation of term frequency is required. but at the same time, for the recognition of a particular license text, the inverse document frequency is required to distinguish license specific terms from general license text term. Thus, the coefficient is calculated using the tf-idf statistics (https://en.wikipedia.org/wiki/Tf–idf).
- Would we need normalization in text mining, like making statistics independent from document length. Could be good or unfortunate because the length of a license text represents also a characteristic.