forked from ic-labs/django-icekit
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Re ic-labs#285 porting
importtools
libs and functionality into GLAM…
…kit.
- Loading branch information
Greg Turner
committed
Aug 24, 2017
1 parent
1a17dec
commit b62f3ce
Showing
31 changed files
with
1,161 additions
and
87 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,126 @@ | ||
GLAMkit comes with tools for analysing and harvesting from large collections of various file types, currently: | ||
|
||
* JSON files | ||
* XML files with unwieldy or undefined schemas (which may also be badly-formed) | ||
* MARC files which use an undocumented set of fields (which may also be badly-formed) | ||
|
||
Requirements | ||
------------ | ||
The different data formats (except for JSON) use libraries that may not be installed by default. Alter the optional | ||
``import_*`` extras in your project's ``requirements-icekit.txt`` to install them, like this:: | ||
|
||
-e git+https://github.com/ic-labs/django-icekit@develop#egg=django-icekit[ ... ,import_xml] | ||
|
||
|
||
JSON Analysis | ||
============= | ||
|
||
*(to be documented)* | ||
|
||
XML Analysis | ||
============ | ||
|
||
(Add ``import_xml`` to ``requirements-icekit.txt`` to install dependencies) | ||
|
||
``manage.py analyze_xml`` is a command-line tool that takes a path (or ``./``) and returns a csv file containing an | ||
analysis of every element in every xml file in the path. It requires the ``lxml`` library. | ||
|
||
Usage examples:: | ||
|
||
manage.py analyze_xml --help # show help | ||
manage.py analyze_xml -l # list all xml files to be analyzed | ||
manage.py analyze_xml # analyze all xml files in the current path | ||
manage.py analyze_xml > analysis.csv # analyze all xml files in the current path and write the results to a csv file. | ||
manage.py analyze_xml path/to/xml/ # analyze all xml files in the given path | ||
manage.py analyze_xml path/to/file.xml # analyze a single xml file | ||
manage.py analyze_xml path/to/xml/ -r # traverse the current path recursively | ||
|
||
The analysis csv contains these fields: | ||
|
||
================= ============================================================== | ||
Column Description | ||
================= ============================================================== | ||
``path`` A dot-separated path to each XML tag. | ||
``min_cardinality`` The minimum number of these elements that each of its parents has. | ||
``max_cardinality`` The maximum number of these elements that each of its parents has. | ||
``samples`` Non-repeating sample values of the text within the XML tag. | ||
``attributes`` A list of all the attributes found for each tag. | ||
================= ============================================================== | ||
|
||
|
||
Interpreting the analysis | ||
------------------------- | ||
|
||
path | ||
~~~~ | ||
|
||
The path is dot-separated. A path that ``looks.like.this`` represents the <this> tag of a file structured like this:: | ||
|
||
<looks> | ||
<like> | ||
<this></this> | ||
</like> | ||
</looks> | ||
|
||
min/max_cardinality | ||
~~~~~~~~~~~~~~~~~~~ | ||
|
||
``min_cardinality`` and ``max_cardinality`` will tell you the minimum and maximum number of these elements you'll have | ||
to deal with each time you encounter them. If a ``min_cardinality`` is 0, it means the element is optional. If a | ||
``max_cardinality`` is 1 it means that it's a singleton value. If ``max_cardinality`` is more than 1, it means that the | ||
element is repeated to make up a list. | ||
|
||
samples | ||
~~~~~~~ | ||
|
||
``samples`` is a particularly useful field. Apart from seeing the values to discern their likely data type, you can | ||
see the variety of values produced. | ||
|
||
Set the number of samples to track with the ``--samplesize`` option. The default value is 5. | ||
|
||
If you asked for 5 sample values, but only got 1 value, that means the value is constant. If you get 2 values, that | ||
means there are only 2 values in the entire collection (which means that the value could be boolean). If you got 0 | ||
values, that means the tag is always empty, or only ever contains children (see the next row of the csv file to see | ||
if an element has any children). | ||
|
||
The number of sample values can be set with the ``-n`` option to ``analyze_xml``, but you should keep it more than 3 | ||
for easily discerning the range of values. | ||
|
||
attributes | ||
~~~~~~~~~~ | ||
|
||
This field lists out all the attributes found for the tag, and a sample of their values. | ||
|
||
MARC Analysis | ||
============= | ||
|
||
(Add ``import_marc`` to ``requirements-icekit.txt`` to install dependencies) | ||
|
||
|
||
``manage.py analyze_marc`` is a command-line tool that takes a path (or ``./``) and returns a csv file containing an | ||
analysis of every MARC file found in the path. It requires the ``pymarc`` library. | ||
|
||
Usage examples:: | ||
|
||
manage.py analyze_marc --help # show help | ||
manage.py analyze_marc -l # list all marc files to be analyzed | ||
manage.py analyze_marc # analyze all MARC files in the current path | ||
manage.py analyze_marc > analysis.csv # analyze all MARC files in the current path and write the results to a csv file. | ||
manage.py analyze_marc path/to/marc/ # analyze all MARC files in the given path | ||
manage.py analyze_marc path/to/file.mrc # analyze a single MARC file | ||
manage.py analyze_marc path/to/marc/ -r # traverse the current path recursively | ||
|
||
The analysis csv has a row for each tag (with an empty subfield column), and a row for each subfield. Each row contains | ||
these fields: | ||
|
||
================= ============================================================== | ||
Column Description | ||
================= ============================================================== | ||
``tag`` The 3-digit MARC tag. | ||
``subfield`` The single-character subfield. | ||
``tag_meaning`` The English meaning of the tag/subfield, if known. | ||
``record_count`` The number of records that have at least one of these tags. | ||
``min_cardinality`` The minimum number of this tag or subfield that each record has. | ||
``max_cardinality`` The maximum number of this tag or subfield that each record has. | ||
``samples`` Non-repeating sample values of the values of each tag or subfield. | ||
================= ============================================================== |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
#!/usr/bin/env python | ||
# −*− coding: UTF−8 −*− | ||
|
||
from optparse import make_option | ||
from django.core.management import BaseCommand | ||
from glamkit_collections.utils.files import getfiles | ||
|
||
|
||
class AnalysisCommand(BaseCommand): | ||
help = "Prints a csv-formatted analysis of paths for all files found at the given paths." | ||
file_regex = r"\.xml$" | ||
|
||
option_list = BaseCommand.option_list + ( | ||
make_option('-r', '--recursive', | ||
action='store_true', | ||
dest='recursive', | ||
default=False, | ||
help="traverse the given folder recursively" | ||
), | ||
make_option("-l", "--list", | ||
action="store_true", | ||
dest="list_only", | ||
default=False, | ||
help="only list the files that would be analyzed" | ||
), | ||
make_option("-s", "--samplesize", | ||
action="store", | ||
dest="sample_length", | ||
default=5, | ||
help="provide this many samples of each element's text (default: 5)" | ||
), | ||
) | ||
|
||
def analyze(self, paths, sample_length): | ||
raise NotImplementedError | ||
|
||
def handle(self, *args, **options): | ||
try: | ||
path = args[0] | ||
except IndexError: | ||
path = "./" | ||
|
||
paths = getfiles(path=path, regex=self.file_regex, recursive=options['recursive']) | ||
|
||
if options['list_only']: | ||
for p in paths: | ||
print p | ||
else: | ||
self.analyze(paths, sample_length=options['sample_length']) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
#!/usr/bin/env python | ||
# −*− coding: UTF−8 −*− | ||
|
||
from . import AnalysisCommand | ||
from glamkit_collections.utils.marc.analyze import marcanalyze | ||
|
||
|
||
class Command(AnalysisCommand): | ||
help = "Prints a csv-formatted analysis of paths for all XML files found at the given paths." | ||
file_regex = r"\.mrc$" | ||
|
||
def analyze(self, paths, sample_length): | ||
return marcanalyze(paths, sample_length) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
#!/usr/bin/env python | ||
# −*− coding: UTF−8 −*− | ||
|
||
from glamkit_collections.utils.xml.lib.analyze import xmlanalyze | ||
from . import AnalysisCommand | ||
|
||
|
||
class Command(AnalysisCommand): | ||
help = "Prints a csv-formatted analysis of paths for all MARC files found at the given paths." | ||
file_regex = r"\.xml$" | ||
|
||
def analyze(self, paths, sample_length): | ||
return xmlanalyze(paths, sample_length) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Legacy imports. TODO: deprecate using here | ||
|
||
from measurements import * | ||
from slugs import * | ||
from cleaning import * | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
#!/usr/bin/python | ||
# -*- coding: UTF-8 -*- | ||
import re | ||
|
||
import itertools | ||
|
||
|
||
def ensure_unique(qs, field_name, value, exclude_id=None): | ||
""" | ||
Makes sure that `value` is unique on model.fieldname. And nonempty. | ||
""" | ||
orig = value | ||
if not value: | ||
value = "None" | ||
for x in itertools.count(1): | ||
if not qs.exclude(id=exclude_id).filter(**{field_name: value}).exists(): | ||
break | ||
if orig: | ||
value = '%s-%d' % (orig, x) | ||
else: | ||
value = '%d' % x | ||
|
||
return value | ||
|
||
|
||
def strip_parens(s): | ||
result = re.sub(r'^\(', '', s) | ||
result = re.sub(r'\)$', '', result) | ||
return result | ||
|
||
|
||
def ndashify(s): | ||
"""replace ' - ' with an n-dash character""" | ||
return re.sub(r' - ', u'–', unicode(s)) | ||
|
||
|
||
def fix_line_breaks(s): | ||
""" | ||
Convert \r\n and \r to \n chars. Strip any leading or trailing whitespace | ||
on each line. Remove blank lines. | ||
""" | ||
l = s.splitlines() | ||
x = [i.strip() for i in l] | ||
x = [i for i in x if i] # remove blank lines | ||
return "\n".join(x) | ||
|
||
|
||
def strip_line_breaks(s): | ||
""" | ||
Remove \r and \n chars, replacing with a space. Strip leading/trailing | ||
whitespace on each line. Remove blank lines. | ||
""" | ||
return re.sub(r'[\r\n ]+', ' ', s).strip() | ||
|
||
|
||
def remove_url_breaking_chars(s): | ||
r = re.sub(r'[\?#&/]', '', s) | ||
return r.strip() | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
import os | ||
import re | ||
|
||
def getfiles(path, regex=r"", recursive=True, followlinks=True): | ||
"""generates a list of file paths of files in given folder that match a given regex""" | ||
|
||
rex = re.compile(regex) | ||
|
||
if os.path.isfile(path): | ||
p = os.path.abspath(path) | ||
if rex.search(p): | ||
yield p | ||
else: | ||
if recursive: | ||
for root, dirs, files in os.walk(path, followlinks): | ||
for f in files: | ||
p = os.path.abspath(os.path.join(root, f)) | ||
if rex.search(p): | ||
yield p | ||
else: | ||
for f in os.listdir(path): | ||
p = os.path.abspath(os.path.join(path, f)) | ||
if os.path.isfile(p): | ||
if rex.search(p): | ||
yield p |
Empty file.
Oops, something went wrong.