Course readings


Textbook for general introduction to social data science:

Textbooks for data science in Python

Preparation and Assignment 0

The course begin with your preparation for the teaching. This consists in Assignment 0 which is a learning module and assignment together. In terms of teaching it provides an overview of basic Python and how to use it for data analysis. We also introduce markdown for writing text that allows simple formatting in plain text.

Required reading

Inspirational reading and other sources for learning

There are many good resources for learning how to master data structuring. See below for two ways of self-learning:

Session 1: Introduction to the course and Python

We introduce the course and provide an overview of logistics. We also introduce git and GitHub that we are working with.

Required readings

Introduction to Social Data Science

Git - a tool for storing and sharing code

Inspirational reading *

If you’re interested, and want to delve deeper into coding and programming (it's optional!), we highly recommend the following posts:

A broad, early, and easy-to-read idea of data driven (social) science:

Sessions 2: Data structuring 1

We learn about data transformation and working with specific data types in pandas: string, missing, categorical and temporal.

Required reading

  • PDA: chapter 7 and sections 5.3, 11.1-11.2, 12.1, 12.3.
  • PML: chapter 4, section 'Handling categorical data'.

Inspirational reading *

Session 3: Data structuring 2

We round-off data structuring by learning two powerful tools in data structuring: combining different data sets and the-split-apply-combine framework which is called groupby in pandas.

Required reading

  • PDA: chapters 8 and 10.

Inspirational reading *

Session 4: Intro to visualization

We introduce visualizations in Python. We use pandas and seaborn. Both these modules are built on the fundamental and flexible plotting module matplotlib.

Required reading

Inspirational reading *

Session 5: Strings, queries and APIs

We start to leverage our python knowledge to make queries on the web. This allows us to pull data directly from Statistics Denmark's API.

Required reading

Session 6: Scraping 1 - Introduction to web scaping

We learn to create and collect datasets from the web. This means interacting with apis and webpages and extracting information from unstructured webpages.

Required readings

Inspirational reading *

Below are some interesting academic papers using data scraped from online sources that might provide inspiration for your exam project.

Session 7: Scraping 2 - Parsing

Here we develop our skills in parsing. This is a fundamental data science skill that goes beyond web scraping alone.

Required readings

Session 8: Scraping 3 - Advanced Scrapers

We become good scrapers being able to automate browsing and using regex but to become great we need to study what others are doing to avoid being scraped. This includes bots, honey traps, AJAX, etc.

Inspirational reading *

Session 9: Ethics and Big Data Intro

TBD

Session 10: Modeling and machine learning

We introduce basic machine learning (ML) concepts. We start with the simple machine learning models for classification problems.

Required readings

  • PML: chapters 1,2 and the following section from chapter 3:
    • Modeling class probabilities via logistic regression

Session 11: Regression and regularization

We explain the overfitting problem of modelling. We show one possible solution is regularization of standard linear models.

Required readings

  • PML: chapter 3, the following sections:
    • Tackling overfitting via regularization
  • PML: chapter 4, the following sections:
    • Partitioning a dataset into separate training and test sets
    • Bringing features onto the same scale
    • Selecting meaningful features
  • PML: chapter 10, the following sections:
    • Introducing linear regression
    • Implementing an ordinary least squares linear regression model
    • Evaluating the performance of linear regression models
    • Using regularized methods for regression
    • Turning a linear regression model into a curve – polynomial regression

Session 12: Model selection and cross-validation

We introduce cross validation to gauge overfitting and review the linear model.

Required readings

  • PML: chapter 6, the following sections:
    • Streamlining workflows with pipelines
    • Using k-fold cross-validation to assess model performance
    • Debugging algorithms with learning and validation curves
    • Fine-tuning machine learning models via grid search

Session 13: Non-linear ML and applications

We give an overview of non-linear machine learning models and outline how machine learning tools can be applied in social science.

Required readings

Session 14: Text data

We introduce the concept of Text as Data, and apply our newly acquired knowledge of supervised learning to a text classification problem.

Required readings

  • PML: following sections from chapter 8:

    • Preparing the IMDb movie review data for text processing
    • Introducing the bag-of-words model
    • Training a logistic regression model for document classification
  • Gentzkow, M., Kelly, B.T. and Taddy, M., 2019. "Text as data" Journal of Economic Literature 57(3).

Jurafsky, D., & Martin, J. H. (2019). Vector Semantics and Embeddings. Speech and Language Processing, 3rd ed. draft. https://web.stanford.edu/~jurafsky/slp3/6.pdf

Inspirational readings *

Gorrell, Genevieve et al. “Twits, Twats and Twaddle: Trends in Online Abuse towards UK Politicians.” ICWSM (2018). https://gate-socmedia.group.shef.ac.uk/wp-content/uploads/2019/07/Gorrell-Greenwood.pdf

Pang, Bo et al. “Thumbs up? Sentiment Classification using Machine Learning Techniques.” EMNLP (2002). https://www.aclweb.org/anthology/W02-1011.pdf

*: Note: There might be a paywall on some of the inspirational readings. Don't worry if you cannot get access - it is in fact only inspirational.