In this lecture we will consider how to generate simulated data from a model, using random number generation on a computer. We will learn about the concept of a generative model, how we can draw random numbers from different distributions, and how we can use these to create simulated data that conform to different statistical models.

Why do this?

There are many reasons we might want to simulate data from a model. Among these are:

In this lecture and practical we will mostly use simulation to gain a deeper understanding of statistical sampling theory. Those continuing on to do projects in semester 2 may discover other uses for simulations in these.

Before we go into further details, perhaps an example will best demonstrate why we might need to generate our own data

How do you know linear regression works?

In MATH1700 you will have learned how to carry out linear regression analyses using the principle of Least Squares. Put very simply, the idea of linear regression is to recover a real (linear) relationship between some predictor variable \(x\) and a response variable \(y\), which has been corrupted by noise. That is, we assume that: \[y = a + bx + \epsilon\] where \(\epsilon\) is the residual.

You will have learned how to estimate \(a\) and \(b\) such that the sum of square residuals is minimised. It can of course be proved analytically that these estimates are unbiased. But it can be more illuminating to see directly that the method by using simulated data. Lets create a data set with a know \(a=1\) and \(b=2\), and then fit a linear regression model using the R function lm(). First, the data creation

a=1
b=2
x = 1:10
epsilon = rnorm(n=10, mean=0, sd=1)
y = a+b*x + epsilon
plot(x, y)

Notice that we had to create \(n=10\) random residuals: one for each data point. These were generated from a normal distribution - more on this later.

Now we fit a linear regression and print a model summary that contains the model parameter estimates.

m = lm(y ~ x)
print(summary(m))
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1348 -0.5624 -0.1393  0.3854  1.6814 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.5255     0.6673   2.286   0.0516 .  
## x             1.9180     0.1075  17.835    1e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9768 on 8 degrees of freedom
## Multiple R-squared:  0.9755, Adjusted R-squared:  0.9724 
## F-statistic: 318.1 on 1 and 8 DF,  p-value: 1e-07

We can see that the estimate for \(b\) (the coefficient of x) in particular is quite close to its true value. The intercept is a little different from the expected value of 1. You can try repeating this exercise many times to see how the coefficients change each time you generate a new data set, and whether they are right . Notice as well that the generated data is quite `clean’, with only a slight deviation from the straight line. What do you think would happen with noisier data sets? What about larger or smaller data sets? how would you investigate this?

One of the assignment tasks also asks you to explore this more systematically.

Generative models

You may have encountered the terms ‘generative model’ or ‘generative AI’ in the context of Large Language Models (LLMs) like ChatGPT or Microsoft Co-Pilot. The word ‘generative’ indicates that these are not models primarily designed to data but to new outputs. In this lecture we won’t be learning specfically about LLMs, but we will be using generative models to generate new outputs in the form of simulated data.

Usually when we specify a statistical model, it looks something like this description of linear regression: \[Y = a + bX + \epsilon\] Here we are stating that the values of the variable \(Y\) (often called the dependent variable or response variable) are given by an intercept value \(a\), a gradient \(b\) multiplied by another variable \(X\) (the independent variable or explanatory variable) plus a \(\epsilon\). The residual is often seen as representing the error in the regression – the gap between what the model predicts and reality

Another way to look at this model is as a specification of how \(Y\) is . In this case, if we know \(X\), \(Y\) is generated by taking the value of \(a + bX\) and then adding a residual \(\epsilon\).

What should the value of the residual be? To determine this we must specify a distribution from which the residuals are drawn. A common choice in linear regression is to assume that the residuals are normally distributed with zero mean and some standard deviation \(\sigma\) \[\epsilon \sim N(0, \sigma)\] So, the instructions for generating \(Y\) be written as:

  1. Multiply the value of \(X\) by \(b\)
  2. Add \(a\)
  3. Draw a normally distributed residual \(\epsilon\) from a normal distribution with zero mean and standard deviation \(\sigma\).
  4. Add \(\epsilon\) to \(a + bX\) to get \(Y\).

Random and pseudo-random numbers

Before we get to the details about how to generate random numbers, perhaps we should stop to consider what we mean by randomness. `Random’ is a word we encounter a lot, but it is not always clear precisely what it means. Is the result of a coin toss, or the roll of a die ‘random’. We often treat them as such, but each is determined by the laws of physics. Perhaps if we knew enough about how a coin is tossed then we could predict which side it will land on with certainty.

There are some physical processes that (as far as we know) are truly random. If you can predict exactly when the nucleus of an atom will decay, then write up your method and prepare to win a Nobel Prize. In the UK the government offers a savings device called Premium Bonds. Each month a number of these bonds are selected to win prizes, and this selection is done by a machine called ERNIE (Electornic Random Number Indicator Equipment), which utilises quantum technology to generate random numbers.

For our purposes however this is rather too complex and takes too long. What we want are numbers that are unpredictable , and follow a specific distribution. For this, we use something called a number generator. This is a deterministic algorithm that generates a sequence of numbers in a very complex unpredictable pattern. There are many different algorithms available, but the one built into R (and popular elsewhere) is called the Mersenne-Twister. We won’t be looking at the algorithm in detail, but you can read more about it on Wikipedia.

The sequence of numbers coming from a pseudo-random number generator (RNG) depends on how it is initiliased. This is called setting the value of the RNG. If you do nothing, most programming languages will set this for you based on the system clock, but it is often a good idea to set it manually. That way, if you re-run your code you will get the same results. Lets set it here in R:

set.seed(123)

Random number generators in R

Programming languages such as R, Python, C and Java all include random number generators. The most fundamental typically generate random integers between 0 and some maximum number \(N\). These can then be transformed to give numbers obeying other statistical distributions.

As a statistics-oriented language, R includes functions that can produce random numbers generated from a wide variety of probability distributions. These all follow a similar syntax, specifying the number of random numbers to generate, and the parameters of the distribution. For instance, to generate \(n=1000\) random numbers from a normal distribution with mean 0 and s.d. 1, we use:

X = rnorm(n=1000, mean = 0, sd = 1)

We can check we have the right distribution by plotting a histogram - the ‘breaks’ parameter specifies how many bins to use:

hist(X, breaks = 10, xlab= 'X', ylab='Frequency')

To see another example, we can follow the same sort of process to generate \(n=1000\) random numbers from an exponential distribution with mean 1 and graph these. Note that the exponential distribution only has one parameter, the rate, where rate = 1/mean.

X = rexp(n=1000, rate = 1)
hist(x=X, breaks = 10, xlab= 'X', ylab='Frequency')

Below is a list of common distributions and their R functions:

Others can be straightforwardly found via Google: try searching for ‘gamma distribution R’ for example. Some more obscure distributions sometimes require you to install a package before using them, e.g. the multivariate normal distribution is in package ‘mvtnorm’.

Simulating data from a model

Random numbers allow us to simulate data from a statistical model. This can be very useful to check whether a model produces simulated data that looks anything like a real data set, and the performance of statistical analysis when analysing data sets of different sizes and types. This type of process is called a Monte Carlo method, named by mathematician Stanislaw Ulam after the famous Monte Carlo casino in Monaco

Monte Carlo Casino and Stanislaw Ulam
Monte Carlo Casino and Stanislaw Ulam

Lets look at an example. We can simulate a data set of heights for \(n\) men and \(m\) women. Lets imagine that men have an average height of 176cm with a s.d of 7cm, and women have an average height of 162cm with a s.d. of 6cm (these numbers are broadly accurate for the UK). Lets set \(n=10\) and \(m=10\). First we create a vector of indicator variables for the sex of \(n\) males (‘M’) and \(m\) females (‘F’):

n = 10
m=10
Sex = c(rep('M', n), rep('F', m)) 

Then we simulate a height for each individual by drawing a random number from the right distribution. First we create a blank vector full of ’NA’s, then we fill it with the required numbers:

Height = rep(NA, n+m)
Height[Sex=='M'] = rnorm(n, mean = 176, sd=7)
Height[Sex=='F'] = rnorm(m, mean = 162, sd=6)

Now we can imagine having received this data, and being asked to estimate the difference in height between men and women on average. The best estimate for this difference would simply be the difference in the two sample means

D = mean(Height[Sex=='M']) -  mean(Height[Sex=='F'])
print(D)
## [1] 11.11343

Recall the true value of the difference is 14cm. Clearly the value of our estimate will depend on the data set. To get an idea how precise this will be we can repeat the whole process a large number of times (say 1000), recording the estimate each time:

D = rep(NA, 1000)
for (i in 1:1000){
Height = rep(NA, n+m)
Height[Sex=='M'] = rnorm(n, mean = 176, sd=7)
Height[Sex=='F'] = rnorm(m, mean = 162, sd=6)
D[i] = mean(Height[Sex=='M']) -  mean(Height[Sex=='F'])
}
hist(D, breaks = 30, xlab = 'Estimated difference', ylab = 'Frequency', main = "", xlim = c(0, 30))
abline(v=14, col='red', lwd=3)

The code above plots a thick red vertical line at the true value of the difference, helping us to see the performance of the estimator. We have also set the limits of the x-axis explictly, so we can make other plots later at the same scale. For comparison, consider how the same estimator performs if we simulate 10 times more data each time: \(n=m=100\)

n = 100
m=100
Sex = c(rep('M', n), rep('F', m)) 
D = rep(NA, 1000)
for (i in 1:1000){
Height = rep(NA, n+m)
Height[Sex=='M'] = rnorm(n, mean = 176, sd=7)
Height[Sex=='F'] = rnorm(m, mean = 162, sd=6)
D[i] = mean(Height[Sex=='M']) -  mean(Height[Sex=='F'])
}
hist(D, breaks = 30, xlab = 'Estimated difference', ylab = 'Frequency', main = "", xlim = c(0, 30))
abline(v=14, col='red', lwd=3)

By plotting the histogram on the same scale we see how much close these estimates cluster to the true answer than before.

Sampling from a set

In statistics we are usually trying to infer some property of a large from a smaller . For example, we might seek as above to estimate the difference in height between males and females across the whole population of the UK, using a sample of 10 individuals of each sex. For mathematical convenience we usually consider the size of the population to be effectively infinite, but of course this isn’t actually true. What if we wanted to explore what small samples could tell us about a finite and relatively small population, such asall the students in the School of Mathematics (approximately 1000 students, split roughly equally male to female)

First lets generate our ‘population’ using the techniques above. We’ll generate 500 male heights and 500 female heights and find the actual average height difference in that population

male_heights_population = rnorm(n=500, m = 176, sd=7)
female_heights_population = rnorm(n=500, m = 162, sd=6)
D_population = mean(male_heights_population) -  mean(female_heights_population)
print(D_population)
## [1] 14.06398

Now imagine taking a random sample of \(n=10\) males and \(m=10\) females. We can take a random sample using the sample() function in R. This has the syntax: sample(x, size, replace = FALSE), where ‘x’ is the vector of numbers to sample from, ‘size’ is how many numbers to sample, and ‘replace’ tells the sampler whether to sample with replacement or not (the default is not). So to take a single sample of 10 males and 10 females without replacement and use this to estimate the population height difference we would use:

male_heights_sample = sample(x=male_heights_population, size= 10,replace=FALSE)
female_heights_sample = sample(x=female_heights_population, size= 10,replace=FALSE)
D_sample = mean(male_heights_sample) -  mean(female_heights_sample)
print(D_sample)
## [1] 16.36472

As before, we can assess how well this estimator performs by repeating the sampling process many times. This time the true answer is given by the population mean difference.

D_sample = rep(NA,1000)
for (i in 1:1000){
male_heights_sample = sample(x=male_heights_population, size= 10,replace=FALSE)
female_heights_sample = sample(x=female_heights_population, size= 10,replace=FALSE)
D_sample[i] = mean(male_heights_sample) -  mean(female_heights_sample)
}
hist(D_sample, breaks = 30, xlab = 'Estimated difference', ylab = 'Frequency', main = "", xlim = c(0, 30))
abline(v=D_population, col='red', lwd=3)

This is useful because it more closely mimics the real sampling process from a population than if we considered the population to be infinite. It also helps us to understand those often-convoluted statistical descriptions of confidence intervals, or of hypothesis tests. Recall that if we test a null hypothesis \(H0\) that the true mean of some population is zero, and our sample mean is \(X\), we want to ask: ‘if we took a very large number of samples of the same size from a population with a true mean of zero, how often would the sample mean be greater than or equal to \(X\)?’ - we can answer this by actually taking a large number of samples using the methods above and looking at the sample means.

Assignment 2: Your Task - Due 11/11/2025 at 23.59pm

You should complete the following tasks and present your findings in a 2-page report. Your report should present your results using a mixture of text and visualisations (e.g. graphs), ideally with one figure for each task, as well as a short introduction and conclusion. Use the LaTeX template provided as for the first assignment.

Your report should be written to illustrate the properties of statistical estimators to an incoming first year undergraduate student or a final year A-level student. Do not include large amounts of R code, but focus instead on the most important details of your investigation and what your findings show. Where possible, relate this to your existing statistics knowledge, but do not include detailed mathematical derivations – it is sufficient to state known results.

  1. Repeat the process shown directly above for \(n=m=k\) where \(k = 1, 5, 10, 25, 50\). Visualise the performance of the estimator as a function of \(k\). Confirm that the average square error in D_sample scales as \(1/k\). What do you notice if \(k = 500\)?

  2. Consider a linear model: \(y = 1 + x + \epsilon\), where \(\epsilon \sim N(0, 1)\). Given values of \(x = 0, 1, 2, \ldots, k\), simulate corresponding values of \(y_1, \ldots y_k\), with \(k=10\). Fit a linear model using the lm() command in R and record the estimate of the slope. Repeat this 1000 times and visualise how the estimates cluster around the true value, as in the demonstration above. What happens if you try using different standard deviations \(\sigma\) for the residuals, \(\epsilon \sim N(0, \sigma)\)?

  1. Simulate \(n=10\) points in a unit square, by drawing \(x\) and \(y\) coordinates independently from uniform distributions between 0 and 1. Count how many of these points obey \(x^2 + y^2 < 1\). How can you use this to estimate the value of \(\pi\)? Repeat your analysis for \(n=50, 100, 500, 1000\), performing the test multiple times for each value of \(n\) and visualise how the error in your estimate of \(\pi\) changes. Comment on the pattern you see.