Skip to content

Commit

Permalink
Re ic-labs#285 porting importtools libs and functionality into GLAM…
Browse files Browse the repository at this point in the history
…kit.
  • Loading branch information
Greg Turner committed Aug 24, 2017
1 parent 1a17dec commit b62f3ce
Show file tree
Hide file tree
Showing 31 changed files with 1,161 additions and 87 deletions.
3 changes: 2 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,8 @@ instead::

$ bash <(curl -Ls https://raw.githubusercontent.com/ic-labs/django-icekit/develop/startproject.sh) {project_name} develop

and change the icekit branch in the generated :code:`requirements-icekit.txt` from :code:`@master` to :code:`@develop`.
and change the icekit branch in the generated :code:`requirements-icekit.txt` and :code:`Dockerfile` from
:code:`@master` to :code:`@develop`.

NOTE: Windows users should run this command in Git Bash, which comes
with `Git for Windows <https://git-for-windows.github.io/>`__.
Expand Down
Empty file added __init__.py
Empty file.
126 changes: 126 additions & 0 deletions docs/collections/analyzing_data.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
GLAMkit comes with tools for analysing and harvesting from large collections of various file types, currently:

* JSON files
* XML files with unwieldy or undefined schemas (which may also be badly-formed)
* MARC files which use an undocumented set of fields (which may also be badly-formed)

Requirements
------------
The different data formats (except for JSON) use libraries that may not be installed by default. Alter the optional
``import_*`` extras in your project's ``requirements-icekit.txt`` to install them, like this::

-e git+https://github.com/ic-labs/django-icekit@develop#egg=django-icekit[ ... ,import_xml]


JSON Analysis
=============

*(to be documented)*

XML Analysis
============

(Add ``import_xml`` to ``requirements-icekit.txt`` to install dependencies)

``manage.py analyze_xml`` is a command-line tool that takes a path (or ``./``) and returns a csv file containing an
analysis of every element in every xml file in the path. It requires the ``lxml`` library.

Usage examples::

manage.py analyze_xml --help # show help
manage.py analyze_xml -l # list all xml files to be analyzed
manage.py analyze_xml # analyze all xml files in the current path
manage.py analyze_xml > analysis.csv # analyze all xml files in the current path and write the results to a csv file.
manage.py analyze_xml path/to/xml/ # analyze all xml files in the given path
manage.py analyze_xml path/to/file.xml # analyze a single xml file
manage.py analyze_xml path/to/xml/ -r # traverse the current path recursively

The analysis csv contains these fields:

================= ==============================================================
Column Description
================= ==============================================================
``path`` A dot-separated path to each XML tag.
``min_cardinality`` The minimum number of these elements that each of its parents has.
``max_cardinality`` The maximum number of these elements that each of its parents has.
``samples`` Non-repeating sample values of the text within the XML tag.
``attributes`` A list of all the attributes found for each tag.
================= ==============================================================


Interpreting the analysis
-------------------------

path
~~~~

The path is dot-separated. A path that ``looks.like.this`` represents the <this> tag of a file structured like this::

<looks>
<like>
<this></this>
</like>
</looks>

min/max_cardinality
~~~~~~~~~~~~~~~~~~~

``min_cardinality`` and ``max_cardinality`` will tell you the minimum and maximum number of these elements you'll have
to deal with each time you encounter them. If a ``min_cardinality`` is 0, it means the element is optional. If a
``max_cardinality`` is 1 it means that it's a singleton value. If ``max_cardinality`` is more than 1, it means that the
element is repeated to make up a list.

samples
~~~~~~~

``samples`` is a particularly useful field. Apart from seeing the values to discern their likely data type, you can
see the variety of values produced.

Set the number of samples to track with the ``--samplesize`` option. The default value is 5.

If you asked for 5 sample values, but only got 1 value, that means the value is constant. If you get 2 values, that
means there are only 2 values in the entire collection (which means that the value could be boolean). If you got 0
values, that means the tag is always empty, or only ever contains children (see the next row of the csv file to see
if an element has any children).

The number of sample values can be set with the ``-n`` option to ``analyze_xml``, but you should keep it more than 3
for easily discerning the range of values.

attributes
~~~~~~~~~~

This field lists out all the attributes found for the tag, and a sample of their values.

MARC Analysis
=============

(Add ``import_marc`` to ``requirements-icekit.txt`` to install dependencies)


``manage.py analyze_marc`` is a command-line tool that takes a path (or ``./``) and returns a csv file containing an
analysis of every MARC file found in the path. It requires the ``pymarc`` library.

Usage examples::

manage.py analyze_marc --help # show help
manage.py analyze_marc -l # list all marc files to be analyzed
manage.py analyze_marc # analyze all MARC files in the current path
manage.py analyze_marc > analysis.csv # analyze all MARC files in the current path and write the results to a csv file.
manage.py analyze_marc path/to/marc/ # analyze all MARC files in the given path
manage.py analyze_marc path/to/file.mrc # analyze a single MARC file
manage.py analyze_marc path/to/marc/ -r # traverse the current path recursively

The analysis csv has a row for each tag (with an empty subfield column), and a row for each subfield. Each row contains
these fields:

================= ==============================================================
Column Description
================= ==============================================================
``tag`` The 3-digit MARC tag.
``subfield`` The single-character subfield.
``tag_meaning`` The English meaning of the tag/subfield, if known.
``record_count`` The number of records that have at least one of these tags.
``min_cardinality`` The minimum number of this tag or subfield that each record has.
``max_cardinality`` The maximum number of this tag or subfield that each record has.
``samples`` Non-repeating sample values of the values of each tag or subfield.
================= ==============================================================
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ This documentation covers the technical usage and API of GLAMkit.

architecture/index
topics/*
collections/*
reference/*
contributing/index
changelog
Expand Down
49 changes: 49 additions & 0 deletions glamkit_collections/management/commands/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/usr/bin/env python
# −*− coding: UTF−8 −*−

from optparse import make_option
from django.core.management import BaseCommand
from glamkit_collections.utils.files import getfiles


class AnalysisCommand(BaseCommand):
help = "Prints a csv-formatted analysis of paths for all files found at the given paths."
file_regex = r"\.xml$"

option_list = BaseCommand.option_list + (
make_option('-r', '--recursive',
action='store_true',
dest='recursive',
default=False,
help="traverse the given folder recursively"
),
make_option("-l", "--list",
action="store_true",
dest="list_only",
default=False,
help="only list the files that would be analyzed"
),
make_option("-s", "--samplesize",
action="store",
dest="sample_length",
default=5,
help="provide this many samples of each element's text (default: 5)"
),
)

def analyze(self, paths, sample_length):
raise NotImplementedError

def handle(self, *args, **options):
try:
path = args[0]
except IndexError:
path = "./"

paths = getfiles(path=path, regex=self.file_regex, recursive=options['recursive'])

if options['list_only']:
for p in paths:
print p
else:
self.analyze(paths, sample_length=options['sample_length'])
14 changes: 14 additions & 0 deletions glamkit_collections/management/commands/analyze_marc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/usr/bin/env python
# −*− coding: UTF−8 −*−

from . import AnalysisCommand
from glamkit_collections.utils.marc.analyze import marcanalyze


class Command(AnalysisCommand):
help = "Prints a csv-formatted analysis of paths for all XML files found at the given paths."
file_regex = r"\.mrc$"

def analyze(self, paths, sample_length):
return marcanalyze(paths, sample_length)

14 changes: 14 additions & 0 deletions glamkit_collections/management/commands/analyze_xml.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/usr/bin/env python
# −*− coding: UTF−8 −*−

from glamkit_collections.utils.xml.lib.analyze import xmlanalyze
from . import AnalysisCommand


class Command(AnalysisCommand):
help = "Prints a csv-formatted analysis of paths for all MARC files found at the given paths."
file_regex = r"\.xml$"

def analyze(self, paths, sample_length):
return xmlanalyze(paths, sample_length)

7 changes: 7 additions & 0 deletions glamkit_collections/utils/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Legacy imports. TODO: deprecate using here

from measurements import *
from slugs import *
from cleaning import *


59 changes: 59 additions & 0 deletions glamkit_collections/utils/cleaning.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import re

import itertools


def ensure_unique(qs, field_name, value, exclude_id=None):
"""
Makes sure that `value` is unique on model.fieldname. And nonempty.
"""
orig = value
if not value:
value = "None"
for x in itertools.count(1):
if not qs.exclude(id=exclude_id).filter(**{field_name: value}).exists():
break
if orig:
value = '%s-%d' % (orig, x)
else:
value = '%d' % x

return value


def strip_parens(s):
result = re.sub(r'^\(', '', s)
result = re.sub(r'\)$', '', result)
return result


def ndashify(s):
"""replace ' - ' with an n-dash character"""
return re.sub(r' - ', u'–', unicode(s))


def fix_line_breaks(s):
"""
Convert \r\n and \r to \n chars. Strip any leading or trailing whitespace
on each line. Remove blank lines.
"""
l = s.splitlines()
x = [i.strip() for i in l]
x = [i for i in x if i] # remove blank lines
return "\n".join(x)


def strip_line_breaks(s):
"""
Remove \r and \n chars, replacing with a space. Strip leading/trailing
whitespace on each line. Remove blank lines.
"""
return re.sub(r'[\r\n ]+', ' ', s).strip()


def remove_url_breaking_chars(s):
r = re.sub(r'[\?#&/]', '', s)
return r.strip()

25 changes: 25 additions & 0 deletions glamkit_collections/utils/files.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import os
import re

def getfiles(path, regex=r"", recursive=True, followlinks=True):
"""generates a list of file paths of files in given folder that match a given regex"""

rex = re.compile(regex)

if os.path.isfile(path):
p = os.path.abspath(path)
if rex.search(p):
yield p
else:
if recursive:
for root, dirs, files in os.walk(path, followlinks):
for f in files:
p = os.path.abspath(os.path.join(root, f))
if rex.search(p):
yield p
else:
for f in os.listdir(path):
p = os.path.abspath(os.path.join(path, f))
if os.path.isfile(p):
if rex.search(p):
yield p
Empty file.
Loading

0 comments on commit b62f3ce

Please sign in to comment.