March 20, 2022

Introducing Core Concepts

Thinking about all survey data as tensors

We deal exclusively with discrete variables. This means all variables can be represented as tensors. Tensors have a few properties:

rank: the number of dimensions
shape: a row vector with length equal to rank

Let’s consider a multi-select question with 5 options. The data might look something like: [1,0,1,1,0] for a single response, or [[1,0,1,1,0,1], [0,0,0,1,0], [1,1,1,1,1]] for all 3 responses.

The shape of this tensor is $(n_{responses}, n_{options})$. The rank is 1 more than the rank of the underlying data type, which for selection questions is a rank 1 row-vector.

Generalizing the underlying data format

Rank 0 question types: Number, Slider
Rank 1 question types: Selection, Dropdown
- $shape=(n_{options})$
Rank 2 question types: Likert
- $shape=(n_{options}, n_{prompts})$
Unstructured types: Text, audio, video, etc.
- Can be converted into vectors with text/speech analytics (feature extraction)

Formalizing Survey Variables

Consider a survey $S$ with 2 questions:

What is your gender? (M, F, Other)
What is your favourite letter? (A, B, C)

We can call all of the answers to these questions $X_1$ and $X_2$, our first 2 variables. Say we have N respondents answer this survey:

$$ R_1, R_2, ... , R_N $$

We know that $X_1= \{X_1[R_a], X_1[R_b], X_1[R_c], ...\}$ is a set potentially as large as $N$ (we definitely know $|X_n| \leq |R|$).