Foundations of tidy data for building data skills
A non-technical guide for students in the applied social sciences
Note: This is a work in progress as part of my Data Visualization course at the University of Michigan’s School of Social Work. Feel free to subscribe to my articles if you are interested in articles for developing foundation data skills.
In the applied social sciences, like social work and public health, the importance of data has grown exponentially. However, more than merely collecting data is required; how it’s organized is crucial to its usability and effectiveness. This is where the concept of “tidy data” comes into play.
At its core, the concept of “tidy data,” as articulated by Hadley Wickham, is about structuring datasets to facilitate easy analysis. Imagine trying to find a particular book in a vast library. If the books are haphazardly placed, your task becomes monumentally challenging. However, the task becomes significantly easier if they are organized systematically by genre or author. Similarly, tidy data principles help streamline data analysis by ensuring data is organized effectively.
- Columns: Every column in a dataset should represent a single variable. In this context, a variable refers to any measured attribute or characteristic.
- Rows: Each row should capture one observation, a set of measurements taken under defined conditions.
- Observational units: Each type of observational unit (the primary entity or object being studied) has its distinct table, which is essential. Combining multiple observational units in one table can muddle the clarity and precision of the data. Knowing is critical to effectively structuring your data.
Understanding the significance of tidy data becomes clearer with an example from the social work sector. Consider a scenario where a social worker is assessing the well-being of families and individual children within those families in a community.
Principles of Observational Units
In data organization, particularly in social work, understanding the concept of observational units is crucial. An observational unit is the primary entity or object upon which measurements or observations are made. In essence, it’s what your data is “about.” Recognizing and correctly structuring these units is pivotal, as it can significantly affect your data’s quality, clarity, and usefulness.
Why Are Observational Units Important?
- Precision in Representation: Properly identifying observational units ensures each entity is represented accurately without redundancy.
- Efficient Data Analysis: Structured data based on observational units is more straightforward to manipulate, analyze, and interpret.
- Consistency and Clarity: Data becomes easier to understand, leading to fewer errors and misinterpretations.
- Flexibility for Expansion: As research or projects evolve, structured data allows for more straightforward integration of new information.
A Social Work Example
Imagine a social worker is involved in a project to evaluate the well-being of families and children in a particular community. In this case, there are two primary observational units:
Families: Representing the collective entity of related individuals living together.
Children: Representing individual young members within these families.
Attempting to capture both families and children in one table leads to:
Notice the repetition? The data about each family (like address, income, and caregiver) is redundant. Now, applying the principles of tidy data, we would have two separate tables that are linked by `Family_ID`.
As another example, consider services provided over time. Take a moment and think about how you might restructure this data.
In this example, the dates are spread across multiple columns. This structure makes the data harder to analyze, update, and maintain. The goal of tidy data principles is to restructure such data for clarity, efficiency, and ease of use.
Effectively preparing data tables and spreadsheets
Getting the data into the right structure is critical to effective data analysis. Now, after you have established the correct structure, let’s consider some essential principles for preparing tables and spreadsheets.
- Naming Files:
- Descriptive Names: Ensure file names represent the content of the spreadsheet.
- Date Formatting: If you include a date in the filename, adopt a consistent format like `YYYY-MM-DD` so that files are chronologically sorted.
- Version Control: Denote file versions using a version number (e.g., `v1`, `v2`). Also, keep track of every change by maintaining a log or using a version control system.
- Avoid Special Characters: Refrain from using characters like \ / : * ? <> | in file names, as these can cause issues in file paths.
2. Naming Variables (Columns):
- First Row for Variable Names: The first row should exclusively contain variable names. This ensures clarity and consistency across the spreadsheet.
- Consistency in Naming: Maintain a uniform naming convention. If you choose camelCase or underscores, be consistent throughout.
- Descriptive: Names should offer immediate insight into the data they represent. They should be short but intuitive, capturing exactly what is contained.
- No Spaces or Special Characters: Avoid spaces and characters like `#`, `%`, `&` in variable names, as these can interfere with formulas and scripts.
3. Spreadsheet Formatting:
- Clean Header Row: The topmost row (header) is reserved **only** for variable names. It should not contain any data values.
- Remove Formatting: Do not use highlights, bold typeface, titles, and other decorative formatting in the header or data rows. This ensures that the data is the primary focus and reduces confusion.
- Consistent Data Format: Each column should be of the same data type.
- No Merged Cells: Avoid merging cells, as they can introduce complications during data analysis and when importing/exporting data.
4. Data Management:
- Preserve Original Data: Always retain the original data untouched, especially during cleaning or preparation for analysis.
- Avoid Blank Cells: Empty cells can disrupt analysis. Instead of leaving cells blank, consider using a consistent marker like `NA` to denote missing data. Specify missing values distinctly.
- Data Validation: Implement data validation to ensure consistent and accurate data entry, restricting entries to specific types or formats.
5. Data Organization
- Tidy Data Principles: Adhere to the foundations of tidy data.
- Avoid Storing Data as Images/Objects: Always input data in text or numeric format. Avoid embedding images or objects, as they cannot be analyzed.
- Separate Raw and Processed Data: Maintain a clear distinction between raw and processed data using separate sheets or tabs. Always preserve raw and original data before any modifications or analysis.
- Use or Create a Data Dictionary: Describes what’s contained in a dataset, defines values, and provides variable definitions.