What is data? What counts as data? These are questions we will explore throughout the workshop.
Data is foundational to nearly all digital projects and often help us to understand and express our ideas and narratives. Hence, in order to do digital work, we should know how data is captured, constructed, and manipulated. In this workshop we will be discussing the basics of research data, in terms of its material, transformation, and presentation. We will also engage with the ethical dimensions of what it means to work with data, from collection to visualization to representation.
In this workshop, you will:
By the end of this workshop, participants will:
- Know the stages of data analysis
- Understand the difference between proprietary and open data formats
- Become familiar with the specific requirements of "high quality data"
- Learn about ethical issues around working with different types of data and analysis
This workshop is estimated to take you 3–4 hours to complete.
- Data is Foundational
- Stages of Data
- Stages of Data: Raw
- Stages of Data: Processed/Transformed
- Side Note on Data Structures: Tidy Data
- More Stages of Data: Cleaned
- More Stages of Data: Analyzed
- More Stages of Data: Visualized
- Data Literacy and Ethics
- Some Concluding Thoughts
If you do not have experience or basic knowledge of the following workshops, you may want to look into those before you start with Data Literacies:
- Introduction to the Command Line (required) This workshop makes reference to concepts from the Command Line workshop, and having some knowledge about how to use the command line will be central for anyone who wants to learn about how to handle and process data and data analysis.
- Download the workshop dataset (required) The dataset,
moSmall.csv
, will be used throughout the challenges in the workshop. To save the file to your local computer, right click on the "Download the workshop dataset" link and chooseSave Link As...
. Note: It is important to make sure your file is saved as a.csv
file. Original dataset taken from The Metropolitan Museum of Art's Creative Commons Zero.
Before you start the Data Literacies workshop, we want to remind you of some ethical considerations to take into account when you read through the lessons of this workshop:
- Data and data analysis is not free from bias. There is no magic blackbox for which data emerges from and is contextually driven. As we think about the automation process of looking at "big" data, we have to be aware of the biases that gets reproduced that is "hidden."
- De-identified information can be reconstructed from piecemeal data found across different sources. When we consider what we are doing with the data we have collected, we also need to think about the possible re-identification of our participants.
- Consider how you may use differential privacy as a strategy against re-identification. Consider the US Census 2020 example on utilizing this strategy to address privacy concerns.
- Big data projects often times requiring sharing data sets across different individuals and teams. In addition, to ensure that our work is reproducible and accountable, we may also feel inclined to share the data collected. As such, figuring out how to share such data is crucial in the project planning stage.
Before you start the Data Literacies workshop, you may want to read a couple of our pre-reading suggestions:
- In Big? Smart? Clean? Messy? Data in the Humanities, Christof Schöch discusses what data means in the humanities and the necessity of "smart big data."
- The book, Bit By Bit: Social Research in Digital Age, written by Matthew Salganik, approaches data and social research from a computational social science perspective. He also discusses the idea of "readymade" and "custommade" data alongside ethics.
- Ten Simple Rules for Responsible Big Data Research explores some guidelines for addressing complex ethical issues that arise in any research project.
You may also want to check out a couple of projects that use the skills discussed in this workshop:
- The Data for Public Good is a semester-long collaborative project led by CUNY graduate students. Each semester, a different public-interest dataset is explored to present information that is useful and informative to a public audience.
- SAFElab, led by Dr. Desmond U. Patton, uses computational and social work approaches to understand the mechanisms of violence and work on prevention and intervention in violence that occur in neighborhoods and on social media.
This workshop is the result of a collaborative effort of a team of people, mostly involved presently or in the past, with the Graduate Center's Digital Initiatives. If you want to see statistics for contributions to this workshop, you can do so here. This is a list of all the contributors:
- Current author: Di Yoong
- Past contributing author: Stephen Zweibel
- Past reviewer: Stefano Morello
- Past reviewer: Filipa Calado
- Current editor: Lisa Rhody
- Current editor: Kalle Westerling
Digital Research Institute (DRI) Curriculum by Graduate Center Digital Initiatives is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at https://github.com/DHRI-Curriculum. When sharing this material or derivative works, preserve this paragraph, changing only the title of the derivative work, or provide comparable attribution.