Re ic-labs#285 porting importtools libs and functionality into GLAM…

…kit.
ACMILabs · Aug 24, 2017 · b62f3ce · b62f3ce
1 parent 1a17dec
commit b62f3ce
Show file tree

Hide file tree

Showing 31 changed files with 1,161 additions and 87 deletions.
diff --git a/README.rst b/README.rst
@@ -72,7 +72,8 @@ instead::
 
    $ bash <(curl -Ls https://raw.githubusercontent.com/ic-labs/django-icekit/develop/startproject.sh) {project_name} develop
 
-and change the icekit branch in the generated :code:`requirements-icekit.txt` from :code:`@master` to :code:`@develop`.
+and change the icekit branch in the generated :code:`requirements-icekit.txt` and :code:`Dockerfile` from
+:code:`@master` to :code:`@develop`.
 
 NOTE: Windows users should run this command in Git Bash, which comes
 with `Git for Windows <https://git-for-windows.github.io/>`__.

diff --git a/__init__.py b/__init__.py
diff --git a/docs/collections/analyzing_data.rst b/docs/collections/analyzing_data.rst
@@ -0,0 +1,126 @@
+GLAMkit comes with tools for analysing and harvesting from large collections of various file types, currently:
+
+   * JSON files
+   * XML files with unwieldy or undefined schemas (which may also be badly-formed)
+   * MARC files which use an undocumented set of fields (which may also be badly-formed)
+
+Requirements
+------------
+The different data formats (except for JSON) use libraries that may not be installed by default. Alter the optional
+``import_*`` extras in your project's ``requirements-icekit.txt`` to install them, like this::
+
+   -e git+https://github.com/ic-labs/django-icekit@develop#egg=django-icekit[ ... ,import_xml]
+
+
+JSON Analysis
+=============
+
+*(to be documented)*
+
+XML Analysis
+============
+
+(Add ``import_xml`` to ``requirements-icekit.txt`` to install dependencies)
+
+``manage.py analyze_xml`` is a command-line tool that takes a path (or ``./``) and returns a csv file containing an
+analysis of every element in every xml file in the path. It requires the ``lxml`` library.
+
+Usage examples::
+
+    manage.py analyze_xml --help               # show help
+    manage.py analyze_xml -l                   # list all xml files to be analyzed
+    manage.py analyze_xml                      # analyze all xml files in the current path
+    manage.py analyze_xml > analysis.csv       # analyze all xml files in the current path and write the results to a csv file.
+    manage.py analyze_xml path/to/xml/         # analyze all xml files in the given path
+    manage.py analyze_xml path/to/file.xml     # analyze a single xml file
+    manage.py analyze_xml path/to/xml/ -r      # traverse the current path recursively
+
+The analysis csv contains these fields:
+
+=================    ==============================================================
+Column               Description
+=================    ==============================================================
+``path``             A dot-separated path to each XML tag.
+``min_cardinality``  The minimum number of these elements that each of its parents has.
+``max_cardinality``  The maximum number of these elements that each of its parents has.
+``samples``          Non-repeating sample values of the text within the XML tag.
+``attributes``       A list of all the attributes found for each tag.
+=================    ==============================================================
+
+
+Interpreting the analysis
+-------------------------
+
+path
+~~~~
+
+The path is dot-separated. A path that ``looks.like.this`` represents the <this> tag of a file structured like this::
+
+   <looks>
+      <like>
+         <this></this>
+      </like>
+   </looks>
+
+min/max_cardinality
+~~~~~~~~~~~~~~~~~~~
+
+``min_cardinality`` and ``max_cardinality`` will tell you the minimum and maximum number of these elements you'll have
+to deal with each time you encounter them. If a ``min_cardinality`` is 0, it means the element is optional. If a
+``max_cardinality`` is 1 it means that it's a singleton value. If ``max_cardinality`` is more than 1, it means that the
+element is repeated to make up a list.
+
+samples
+~~~~~~~
+
+``samples`` is a particularly useful field. Apart from seeing the values to discern their likely data type, you can
+see the variety of values produced.
+
+Set the number of samples to track with the ``--samplesize`` option. The default value is 5.
+
+If you asked for 5 sample values, but only got 1 value, that means the value is constant. If you get 2 values, that
+means there are only 2 values in the entire collection (which means that the value could be boolean). If you got 0
+values, that means the tag is always empty, or only ever contains children (see the next row of the csv file to see
+if an element has any children).
+
+The number of sample values can be set with the ``-n`` option to ``analyze_xml``, but you should keep it more than 3
+for easily discerning the range of values.
+
+attributes
+~~~~~~~~~~
+
+This field lists out all the attributes found for the tag, and a sample of their values.
+
+MARC Analysis
+=============
+
+(Add ``import_marc`` to ``requirements-icekit.txt`` to install dependencies)
+
+
+``manage.py analyze_marc`` is a command-line tool that takes a path (or ``./``) and returns a csv file containing an
+analysis of every MARC file found in the path.  It requires the ``pymarc`` library.
+
+Usage examples::
+
+    manage.py analyze_marc --help              # show help
+    manage.py analyze_marc -l                  # list all marc files to be analyzed
+    manage.py analyze_marc                     # analyze all MARC files in the current path
+    manage.py analyze_marc > analysis.csv      # analyze all MARC files in the current path and write the results to a csv file.
+    manage.py analyze_marc path/to/marc/       # analyze all MARC files in the given path
+    manage.py analyze_marc path/to/file.mrc    # analyze a single MARC file
+    manage.py analyze_marc path/to/marc/ -r    # traverse the current path recursively
+
+The analysis csv has a row for each tag (with an empty subfield column), and a row for each subfield. Each row contains
+these fields:
+
+=================    ==============================================================
+Column               Description
+=================    ==============================================================
+``tag``              The 3-digit MARC tag.
+``subfield``         The single-character subfield.
+``tag_meaning``      The English meaning of the tag/subfield, if known.
+``record_count``     The number of records that have at least one of these tags.
+``min_cardinality``  The minimum number of this tag or subfield that each record has.
+``max_cardinality``  The maximum number of this tag or subfield that each record has.
+``samples``          Non-repeating sample values of the values of each tag or subfield.
+=================    ==============================================================
diff --git a/docs/index.rst b/docs/index.rst
@@ -31,6 +31,7 @@ This documentation covers the technical usage and API of GLAMkit.
 
    architecture/index
    topics/*
+   collections/*
    reference/*
    contributing/index
    changelog

diff --git a/glamkit_collections/management/commands/__init__.py b/glamkit_collections/management/commands/__init__.py
@@ -0,0 +1,49 @@
+#!/usr/bin/env python
+# −*− coding: UTF−8 −*−
+
+from optparse import make_option
+from django.core.management import BaseCommand
+from glamkit_collections.utils.files import getfiles
+
+
+class AnalysisCommand(BaseCommand):
+    help = "Prints a csv-formatted analysis of paths for all files found at the given paths."
+    file_regex = r"\.xml$"
+
+    option_list = BaseCommand.option_list + (
+        make_option('-r', '--recursive',
+            action='store_true',
+            dest='recursive',
+            default=False,
+            help="traverse the given folder recursively"
+        ),
+        make_option("-l", "--list",
+            action="store_true",
+            dest="list_only",
+            default=False,
+            help="only list the files that would be analyzed"
+        ),
+        make_option("-s", "--samplesize",
+            action="store",
+            dest="sample_length",
+            default=5,
+            help="provide this many samples of each element's text (default: 5)"
+        ),
+    )
+
+    def analyze(self, paths, sample_length):
+        raise NotImplementedError
+
+    def handle(self, *args, **options):
+        try:
+            path = args[0]
+        except IndexError:
+            path = "./"
+
+        paths = getfiles(path=path, regex=self.file_regex, recursive=options['recursive'])
+
+        if options['list_only']:
+            for p in paths:
+                print p
+        else:
+            self.analyze(paths, sample_length=options['sample_length'])
diff --git a/glamkit_collections/management/commands/analyze_marc.py b/glamkit_collections/management/commands/analyze_marc.py
@@ -0,0 +1,14 @@
+#!/usr/bin/env python
+# −*− coding: UTF−8 −*−
+
+from . import AnalysisCommand
+from glamkit_collections.utils.marc.analyze import marcanalyze
+
+
+class Command(AnalysisCommand):
+    help = "Prints a csv-formatted analysis of paths for all XML files found at the given paths."
+    file_regex = r"\.mrc$"
+
+    def analyze(self, paths, sample_length):
+        return marcanalyze(paths, sample_length)
+
diff --git a/glamkit_collections/management/commands/analyze_xml.py b/glamkit_collections/management/commands/analyze_xml.py
@@ -0,0 +1,14 @@
+#!/usr/bin/env python
+# −*− coding: UTF−8 −*−
+
+from glamkit_collections.utils.xml.lib.analyze import xmlanalyze
+from . import AnalysisCommand
+
+
+class Command(AnalysisCommand):
+    help = "Prints a csv-formatted analysis of paths for all MARC files found at the given paths."
+    file_regex = r"\.xml$"
+
+    def analyze(self, paths, sample_length):
+        return xmlanalyze(paths, sample_length)
+
diff --git a/glamkit_collections/utils/__init__.py b/glamkit_collections/utils/__init__.py
@@ -0,0 +1,7 @@
+# Legacy imports. TODO: deprecate using here
+
+from measurements import *
+from slugs import *
+from cleaning import *
+
+
diff --git a/glamkit_collections/utils/cleaning.py b/glamkit_collections/utils/cleaning.py
@@ -0,0 +1,59 @@
+#!/usr/bin/python
+#  -*- coding: UTF-8 -*-
+import re
+
+import itertools
+
+
+def ensure_unique(qs, field_name, value, exclude_id=None):
+    """
+    Makes sure that `value` is unique on model.fieldname. And nonempty.
+    """
+    orig = value
+    if not value:
+        value = "None"
+    for x in itertools.count(1):
+        if not qs.exclude(id=exclude_id).filter(**{field_name: value}).exists():
+            break
+        if orig:
+            value = '%s-%d' % (orig, x)
+        else:
+            value = '%d' % x
+
+    return value
+
+
+def strip_parens(s):
+    result = re.sub(r'^\(', '', s)
+    result = re.sub(r'\)$', '', result)
+    return result
+
+
+def ndashify(s):
+    """replace ' - ' with an n-dash character"""
+    return re.sub(r' - ', u'–', unicode(s))
+
+
+def fix_line_breaks(s):
+    """
+    Convert \r\n and \r to \n chars. Strip any leading or trailing whitespace
+    on each line. Remove blank lines.
+    """
+    l = s.splitlines()
+    x = [i.strip() for i in l]
+    x = [i for i in x if i]  # remove blank lines
+    return "\n".join(x)
+
+
+def strip_line_breaks(s):
+    """
+    Remove \r and \n chars, replacing with a space. Strip leading/trailing
+    whitespace on each line. Remove blank lines.
+    """
+    return re.sub(r'[\r\n ]+', ' ', s).strip()
+
+
+def remove_url_breaking_chars(s):
+    r = re.sub(r'[\?#&/]', '', s)
+    return r.strip()
+
diff --git a/glamkit_collections/utils/files.py b/glamkit_collections/utils/files.py
@@ -0,0 +1,25 @@
+import os
+import re
+
+def getfiles(path, regex=r"", recursive=True, followlinks=True):
+    """generates a list of file paths of files in given folder that match a given regex"""
+
+    rex = re.compile(regex)
+
+    if os.path.isfile(path):
+        p = os.path.abspath(path)
+        if rex.search(p):
+            yield p
+    else: 
+        if recursive:    
+            for root, dirs, files in os.walk(path, followlinks):
+                for f in files:
+                    p = os.path.abspath(os.path.join(root, f))
+                    if rex.search(p):
+                        yield p
+        else:
+            for f in os.listdir(path):
+                p = os.path.abspath(os.path.join(path, f))
+                if os.path.isfile(p):
+                    if rex.search(p):
+                        yield p
diff --git a/glamkit_collections/utils/marc/__init__.py b/glamkit_collections/utils/marc/__init__.py