File organization and naming
are powerful weapons against chaos.

– Jenny Bryan

Organizing projects

  • all files in common directory
  • separate raw data from clean data
  • separate code from data
  • file names: meaningful, sortable, consistent
  • dates like 2016-08-24
        RawData/
        InProcessData/
        CleanData/
        Code/
        Reports/

Your closest collaborator is you six months ago,
but you don't reply to email.

– (paraphrasing) Mark Holder

Have sympathy for your future self.

Spreadsheets

  • make it a rectangle
  • rows = observations, columns = variables
  • one header row; avoid spaces
  • one thing per cell
  • fill in all cells
  • consistent missing value codes
  • care about dates (3 separate columns?)
  • no calculations in raw data files
  • save as CSV
  • don't use font color or highlighting as data

OpenRefine

  • Cleaning & exploration
  • faceting / filtering
  • splitting a column
  • remove trailing/ending text
  • cluster "levels" of a factor
  • identify outliers
  • all actions are reproducible

SQL

    SELECT * FROM species
    SELECT weight/1000.0 FROM species
        WHERE species_id = 'DM' AND weight > 78
        (OR IN)
    ORDER BY
        (ASC DESC)
    GROUP BY
    COUNT(*), SUM(weight)
    JOIN ON

R

  • full programming language
  • focused on programming with data
  • super for data analysis and visualization
  • R Archive ("CRAN") has >9000 add-on packages
  • RStudio: "Integrated development environment" (IDE) for R

dplyr

    select          SELECT *
    filter          WHERE
    mutate          weight/1000.0
    group_by        GROUP BY
    summarize       COUNT(*), AVG(weight)
    arrange         ORDER BY