Writer did an excellent job on completing the discussion board assignment questions in a timely manner and the revising. I will be hiring this writer for future assignments.
Complete the Frankfort-Nachmias and Leon-Guerrero (2018) SPSS® problems and chapter exercises listed below.
Include your answers in a Microsoft® Word document.
Click the Assignment Files tab to upload your assignment.
Please see Chapter 6 material.
Until now, we have ignored the question of who or what should be observed when we collect data or whether the conclusions based on our observations can be generalized to a larger group of observations. In truth, we are rarely able to study or observe everyone or everything we are interested in. Although we have learned about various methods to analyze observations, remember that these observations represent a fraction of all the possible observations we might have chosen. Consider the following research examples.
Example 1: The Muslim Student Association on your campus is interested in conducting a study of experiences with campus diversity. The association has enough funds to survey 300 students from the more than 20,000 enrolled students at your school.
Example 2: Environmental activists would like to assess recycling practices in 2-year and 4-year colleges and universities. There are more than 4,700 colleges and universities nationwide.1
Example 3: The Academic Advising Office is trying to determine how to better address the needs of more than 15,000 commuter students, but determines that it has only enough time and money to survey 500 students.
The primary problem in each situation is that there is too much information and not enough resources to collect and analyze it.
Researchers in the social sciences rarely have enough time or money to collect information about the entire group that interests them. Known as the population, this group includes all the cases (individuals, groups, or objects) in which the researcher is interested. For example, in our first illustration, there are more than 20,000 students; the population in the second illustration consists of 4,700 colleges and universities; and in the third illustration, the population is 15,000 commuter students.
Population A group that includes all the cases (individuals, objects, or groups) in which the researcher is interested.
Fortunately, we can learn a lot about a population if we carefully select a subset of it. This subset is called a sample. Through the process of sampling—selecting a subset of observations from the population—we attempt to generalize the characteristics of the larger group (population) based on what we learn from the smaller group (the sample). This is the basis of inferential statistics—making predictions or inferences about a population from observations based on a sample. Thus, it is important how we select our sample.
The term parameter, associated with the population, refers to measures used to describe the population we are interested in. For instance, the average commuting time for the 15,000 commuter students on your campus is a population parameter because it refers to a population characteristic. In previous chapters, we have learned the many ways of describing a distribution, such as a proportion, a mean, or a standard deviation. When used to describe the population distribution, these measures are referred to as parameters. Thus, a population mean, a population proportion, and a population standard deviation are all parameters.
We use the term statistic when referring to a corresponding characteristic calculated for the sample. For example, the average commuting time for a sample of commuter students is a sample statistic. Similarly, a sample mean, a sample proportion, and a sample standard deviation are all statistics.
Sample A subset of cases selected from a population.
Sampling The process of identifying and selecting a subset of the population for study.
Parameter A measure (e.g., mean, standard deviation) used to describe the population distribution.
Statistic A measure (e.g., mean, standard deviation) used to describe the sample distribution.
In this chapter and in the remaining chapters of this text, we discuss some of the principles involved in generalizing results from samples to the population. We will use different notations when referring to sample statistics and population parameters in our discussion. Table 6.1 presents the sample notation and the corresponding population notation.
Table 6.1 Sample and Population Notations
The distinctions between a sample and a population and between a parameter and a statistic are illustrated in Figure 6.1. We’ve included for illustration the population parameter of 0.60—the proportion of white respondents in the population. However, since we almost never have enough resources to collect information about the population, it is rare that we know the value of a parameter. The goal of most research is to find the population parameter. Researchers usually select a sample from the population to obtain an estimate of the population parameter. Thus, the major objective of sampling theory and statistical inference is to provide estimates of unknown parameters from sample statistics that can be easily obtained and calculated.
Learning Check 6.1
Take a moment to review the definitions of population, sample, parameter, and statistic mean. Use your own words so that the concepts make sense to you. Also review the sample and population notations. These concepts and notations will be used throughout the rest of the text in our discussion of inferential statistics.
We all use the concept of probability in everyday conversation. You might ask, “What is the probability that it will rain tomorrow?” or “What is the likelihood that I will do well on a test?” In everyday conversations, our answers to these questions are rarely systematic, but in the study of statistics, probability has a far more precise meaning.
In the following sections, we will discuss a variety of techniques adopted by social scientists to select samples from populations. The techniques follow a general approach called probability sampling. Before we discuss these techniques, we will briefly review some theories and principles of probability.
Probability A quantitative measure that a particular event will occur.
A probability is a quantitative measure that a particular event will occur. It is expressed as a ratio of the number of times an event will occur relative to the set of all possible and equally likely outcomes. Probability is represented by a lower case p.
p = Number of times an event will occur/Total number of events
Probabilities range in value from 0 (the event will not occur) to 1 (the event will certainly occur). As first discussed in A Closer Look 5.1, probabilities can be expressed as proportions or percentages. Consider, for example, the outcome of rolling a 3 with a six-sided, equally weighted die. The probability of rolling a 3 is 1/6 or 0.17, because this outcome can occur only once out of a total of six possible outcomes: 1, 2, 3, 4, 5, or 6.
Table 6.2 How Often Respondent Uses Media to Get Political News or Information, GSS 2014
Several times a day
Once a day
5–6 days a week
3–4 days a week
1–2 days a week
Less than 1 day a week
Over many rolls of the die, the chances of rolling a 3 is .17. So for every 100 rolls, a 3 would come up 17 times. The other 83 times we would see the other values of the die. We can also convert the proportion to a percentage (.17 × 100) and conclude that the probability of rolling a 3 is 17%.
Sometimes we use information from past events to help us predict the likelihood of future events. Such a method is called the relative frequency method. Let’s consider, for example, a sample of 367 respondents from the 2014 General Social Survey. Respondents were asked how often they use media (including television, newspapers, radio, and the Internet) to get their political news or information; their responses are summarized in Table 6.2.
The ratio of respondents who use the media once a day is 110:367, or when reduced, approximately 1:3. To convert a ratio to a proportion, we divide the numerator (110) by the denominator (367), as shown inTable 6.2. Thus, the probability 110/367 is equivalent to .30. Now, imagine that we wrote down each of the 367 respondents’ names and placed them in a hat. Because the proportion of .3 is closer to 0 than it is to 1, there is a low likelihood that we would select a respondent who uses the media once a day to get political news.
Learning Check 6.2
What is the probability of drawing an ace out of a normal deck of 52 playing cards? It’s not 1/52. There are four aces, so the probability is 4/52 or 1/13. The proportion is .08. The probability of drawing the specific ace, like the ace of spades is 1/52 or .02.
The observed relative frequencies are just an approximation of the true probability of identifying how often a respondent uses the media for political news. The true probabilities can only be determined if we were to repeat the study many times under the same conditions. Then, our long-run relative frequency (or probability) will approximate the true probability.
In Chapter 5 (“The Normal Distribution”), we converted the areas under the normal distribution into proportions or percentages of the number of observations in a sample based on standard deviation units from the mean. These proportions make it possible to estimate the probability of occurrence of these observations. For example, a study of 200 teen girls on the prevalence of texting found the average number of messages a teen girl texts per day to be 70 with a standard deviation of 10. We can estimate that the probability of randomly selecting a teen girl who texts between 70 and 80 messages per day is approximately .3413 (based on the definition of the normal distribution discussed in Chapter 5). We can also say that there is a 34.13% chance that any teen girl drawn randomly from the sample of 200 girls would text between 70 and 80 messages per day.
Social researchers are systematic in their efforts to obtain samples that are representative of the population. Such researchers have adopted a number of approaches for selecting samples from populations. Only one general approach, probability sampling, allows the researcher to use the principles of statistical inference to generalize from the sample to the population.
Probability sampling is a method that enables the researcher to specify for each case in the population the probability of its inclusion in the sample. The purpose of probability sampling is to select a sample that is as representative as possible of the population. The sample is selected in such a way as to allow the use of the principles of probability to evaluate the generalizations made from the sample to the population. A probability sample design enables the researcher to estimate the extent to which the findings based on one sample are likely to differ from what would be found by studying the entire population.
Probability sampling A method of sampling that enables the researcher to specify for each case in the population the probability of its inclusion in the sample.
Although accurate estimates of sampling error can be made only from probability samples, social scientists often use nonprobability samples because they are more convenient and cheaper to collect. Nonprobability samples are useful under many circumstances for a variety of research purposes. Their main limitation is that they do not allow the use of the method of inferential statistics to generalize from the sample to the population. Because through the rest of this text we deal only with inferential statistics, we will not review nonprobability sampling. In the following sections, we will learn about three sampling designs that follow the principles of probability sampling: (1) the simple random sample, (2) the systematic random sample, and (3) the stratified random sample.3
The simple random sample is the most basic probability sampling design, and it is incorporated into even more elaborate probability sampling designs. A simple random sample is a sample design chosen in such a way as to ensure that (a) every member of the population has an equal chance of being chosen and (b) every combination of N members has an equal chance of being chosen.
Simple random sample A sample designed in such a way as to ensure that (a) every member of the population has an equal chance of being chosen and (b) every combination of N members has an equal chance of being chosen.
Let’s take a very simple example to illustrate. Suppose we are conducting a cost-containment study of the 10 hospitals in our region, and we want to draw a sample of 2 hospitals to study intensively. We can put into a hat 10 slips of paper, each representing 1 of the 10 hospitals, and mix the slips carefully. We select one slip out of the hat and identify the hospital it represents. We then make the second draw and select another slip out of the hat and identify it. The two hospitals we identified on the two draws become the two members of our sample: (1) Assuming that we made sure the slips were really well mixed, pure chance determined which hospital was selected on each draw. The sample is a simple random sample because every hospital had the same chance of being selected as a member of our sample of two and (2) every combination of (N = 2) hospitals was equally likely to be chosen.
Researchers usually use computer programs or tables of random numbers in selecting random samples. An abridged table of random numbers is reproduced in Appendix A. To use a random number table, list each member of the population and assign the member a number. Begin anywhere on the table and read each digit that appears in the table in order—up, down, or sideways; the direction does not matter, as long as it follows a consistent path. Whenever we come across a digit in the table of random digits that corresponds to the number of a member in the population of interest, that member is selected for the sample. Continue this process until the desired sample size is reached.
Suppose now that, in your job as a hospital administrator, you are planning to conduct a cost-containment study by examining patients’ records. Out of a total of 300 patients’ records, you want to draw a simple random sample of five. You follow these steps:
Now let’s look at a sampling method that is easier to implement than a simple random sample. The systematic random sample, although not a true probability sample, provides results very similar to those obtained with a simple random sample. It uses a ratio, K, obtained by dividing the population size by the desired sample size:
Systematic random sampling is a method of sampling in which every Kth member in the total population is chosen for inclusion in the sample after the first member of the sample is selected at random from among the first K members in the population.
Recall our opening example in which we had a population of 15,000 commuting students and our sample was limited to 500. In this example,
Using a systematic random sampling method, we first choose any one student at random from among the first 30 students on the list of commuting students. Then, we select every 30th student after that until we reach 500, our desired sample size. Suppose that our first student selected at random happens to be the eighth student on the list. The second student in our sample is then 38th on the list (8 + 30 = 38). The third would be 38 + 30 = 68, the fourth, 68 + 30 = 98, and so on. An example of a systematic random sample is illustrated in Figure 6.2.
Systematic random sampling A method of sampling in which every Kth member (K is a ratio obtained by dividing the population size by the desired sample size) in the total population is chosen for inclusion in the sample after the first member of the sample is selected at random from among the first K members in the population.
Learning Check 6.3
How does a systematic random sample differ from a simple random sample?
Suppose we want to examine the frequency of media access for political news by race and ethnicity. Our population of interest consists of 1,000 individuals, with 700 (or 70%) whites, 200 (or 20%) blacks, and 100 (10%) Latinos. A third type of probability sampling is the stratified random sample. We obtain a stratified random sample by (a) dividing the population into subgroups based on one or more variables central to our analysis and then (b) drawing a simple random sample from each of the subgroups. The choice of subgroups is based on what variables are known and what variables are of interest to us.
Stratified random sample A method of sampling obtained by (a) dividing the population into subgroups based on one or more variables central to our analysis and (b) then drawing a simple random sample from each of the subgroups.
Proportionate stratified sample The size of the sample selected from each subgroup is proportional to the size of that subgroup in the entire population.
Disproportionate stratified sample The size of the sample selected from each subgroup is disproportional to the size of the subgroup in the population.
For our example, the subgroup we are interested in is race and ethnicity. We could divide the population into different racial/ethnic groups and then draw a simple random sample from each group. In aproportionate stratified sample, the size of the sample selected from each subgroup is proportional to the size of that subgroup in the entire population. For a sample of 180 individuals, we would select 126 whites (70%), 36 blacks (20%), and 18 Latinos (10%). Proportional sampling ensures the representation of the subgroup variable.
In a disproportionate stratified sample, the size of the sample selected from each subgroup is deliberately made disproportional to the size of that subgroup in the population. For instance, for our example, we could select a sample (N = 180) consisting of 90 whites (50%), 45 blacks (25%), and 45 Latinos (25%). In such a sampling design, although the sampling probabilities for each population member are not equal (they vary between groups), they are known, and therefore, we can make accurate estimates of error in the inference process.4 Disproportionate stratified sampling is especially useful when we want to compare subgroups with each other, and when the size of some of the subgroups in the population is relatively small. Proportionate sampling can result in the sample having too few members from a small subgroup to yield reliable information about them.
Learning Check 6.4
Can you think of some research questions that could best be studied using a disproportionate stratified random sample? When might it be important to use a proportionate stratified random sample?
We began this chapter with a few examples illustrating why researchers in the social sciences almost never collect information on the entire population that interests them. Instead, they usually select a sample from that population and use the principles of statistical inference to estimate the characteristics, or parameters, of that population based on the characteristics, or statistics, of the sample. In this section, we describe one of the most important concepts in statistical inference—sampling distribution. The sampling distribution helps estimate the likelihood of our sample statistics and, therefore, enables us to generalize from the sample to the population.
To illustrate the concept of the sampling distribution, let’s consider as our population the 20 individuals listed in Table 6.3.6 Our variable, Y, is the income (in dollars) of these 20 individuals, and the parameter we are trying to estimate is the mean income.
A Closer Look 6.1
Disproportionate stratified sampling is especially useful given the increasing diversity of American society. In a diverse society, factors such as race, ethnicity, class, and gender, as well as other categories of experience such as age, religion, and sexual orientation become central in shaping our experiences and defining the differences among us. These factors are an important dimension of the social structure, and they not only operate independently but also are experienced simultaneously by all of us.5 For example, if you are a white woman, you may share some common experiences with a woman of color based on your gender, but your racial experiences are going to be different. Moreover, your experiences within the race/gender system are further shaped by your social class. Similarly, if you are a man, your experiences are shaped as much by your class, race, and sexual orientation as they are by your gender. If you are a black gay man, for instance, you might not benefit equally from patriarchy compared with a classmate who is a white heterosexual male.
What are the research implications of an inclusive approach that emphasizes social differences? Such an approach will include women and men in a study of race, Latinos and people of color when considering class, and women and men of color when studying gender. Such an approach makes the experience of previously excluded groups more visible and central because it puts those who have been excluded at the center of the analysis so that we can better understand the experience of all groups, including those with privilege and power.
What are the sampling implications of such an approach? Suppose you are looking at the labor force experiences of black women and Latinas who are above 50 years of age, and you want to compare these experiences with those of white women in the same age group. Both Latinas and black women comprise a small proportion of the population. A proportional sample probably would not include enough Latinas or black women to provide an adequate basis for comparison with white women. To make such comparisons, it would be desirable to draw a disproportionate stratified sample that deliberately overrepresents both Latinas and black women so that these subsamples will be of sufficient size (Figure 6.3).
Table 6.3 The Population: Personal Income (in Dollars) for 20 Individuals (Hypothetical Data)
Mean (μ) = 22,766
Standard deviation (σ) = 14,687
We use the symbol μ to represent the population mean. Using Formula 3.1, we can calculate the population mean:
Using Formula 4.3, we can also calculate the standard deviation for this population distribution. We use the Greek symbol sigma (σ) to represent the population’s standard deviation:
Of course, most of the time, we do not have access to the population. So instead, we draw one sample, compute the mean—the statistic—for that sample, and use it to estimate the population mean—the parameter.
Let’s assume that μ is unknown and that we estimate its value by drawing a random sample of three individuals (N = 3) from the population of 20 individuals and calculate the mean income for that sample. The incomes included in that sample are as follows:
Now let’s calculate the mean for that sample:
Note that our sample mean, Ῡ = $20,817, differs from the actual population parameter, $22,766. This discrepancy is due to sampling error. Sampling error is the discrepancy between a sample estimate of a population parameter and the real population parameter. By comparing the sample statistic with the population parameter, we can determine the sampling error. The sampling error for our example is 1,949 (22,766 − 20,817 = 1,949).
Sampling error The discrepancy between a sample estimate of a population parameter and the real population parameter.
Now let’s select another random sample of three individuals. This time, the incomes included are as follows:
The mean for this sample is
The sampling error for this sample is 1,421 (22,766 − 21,345 = 1,421), somewhat less than the error for the first sample we selected.
Although comparing the sample estimates of the average income with the actual population average is a perfect way to evaluate the accuracy of our estimate, in practice, we rarely have information about the actual population parameter. If we did, we would not need to conduct a study! Moreover, few, if any, sample estimates correspond exactly to the actual population parameter. This, then, is our dilemma: If sample estimates vary and if most estimates result in some sort of sampling error, how much confidence can we place in the estimate? On what basis can we infer from the sample to the population?
The answer to this dilemma is to use a device known as the sampling distribution. The sampling distribution is a theoretical probability distribution of all possible sample values for the statistic in which we are interested. If we were to draw all possible random samples of the same size from our population of interest, compute the statistic for each sample, and plot the frequency distribution for that statistic, we would obtain an approximation of the sampling distribution. Every statistic—for example, a proportion, a mean, or a variance—has a sampling distribution. Because it includes all possible sample values, the sampling distribution enables us to compare our sample result with other sample values and determine the likelihood associated with that result.7
Sampling distribution The sampling distribution is a theoretical probability distribution of all possible sample values for the statistics in which we are interested.
Sampling distributions are theoretical distributions, which means that they are never really observed. Constructing an actual sampling distribution would involve taking all possible random samples of a fixed size from the population. This process would be very tedious because it would involve a very large number of samples. However, to help grasp the concept of the sampling distribution, let’s illustrate how one could be generated from a limited number of samples.
For our illustration, we use one of the most common sampling distributions—the sampling distribution of the mean. The sampling distribution of the mean is a theoretical distribution of sample means that would be obtained by drawing from the population all possible samples of the same size.
Sampling distribution of the mean A theoretical probability distribution of sample means that would be obtained by drawing from the population all possible samples of the same size.
Let’s go back to our example in which our population is made up of 20 individuals and their incomes. From that population (Table 6.3), we now randomly draw 50 possible samples of size 3 (N = 3), computing the mean income for each sample and replacing it before drawing another.
In our first sample of N = 3, we draw three incomes: $8,451, $41,654, and $18,923. The mean income for this sample is
Now we restore these individuals to the original list and select a second sample of three individuals. The mean income for this sample is
We repeat this process 48 more times, each time computing the sample mean and restoring the sample to the original list. Table 6.4 lists the means of the first five and the 50th samples of N = 3 that were drawn from the population of 20 individuals. (Note that Σ¯¯¯YΣY¯ refers to the sum of all the means computed for each of the samples and M refers to the total number of samples that were drawn.)
The grouped frequency distribution for all 50 sample means (M = 50) is displayed in Table 6.5; Figure 6.4 is a histogram of this distribution. This distribution is an example of a sampling distribution of the mean. Note that in its structure, the sampling distribution resembles a frequency distribution of raw scores, except that here each score is a sample mean, and the corresponding frequencies are the number of samples with that particular mean value. For example, the third interval in Table 6.5 ranges from $19,500 to $23,500, with a corresponding frequency of 14, or 28%. This means that we drew 14 samples (28%) with means ranging between $19,500 and $23,500.
Table 6.4 Mean Income of 50 Samples of Size 3
Total (M) = 50
Remember that the distribution depicted in Table 6.5 and Figure 6.4 is an empirical distribution, whereas the sampling distribution is a theoretical distribution. In reality, we never really construct a sampling distribution. However, even this simple empirical example serves to illustrate some of the most important characteristics of the sampling distribution.
Table 6.5 Sampling Distribution of Sample Means for Sample Size N = 3 Drawn From the Population of 20 Individuals’ Incomes
Sample Mean Intervals
Before we continue, let’s take a moment to review the three distinct types of distribution.
The Population: We began with the population distribution of 20 individuals. This distribution actually exists. It is an empirical distribution that is usually unknown to us. We are interested in estimating the mean income for this population.
The Sample: We drew a sample from that population. The sample distribution is an empirical distribution that is known to us and is used to help us estimate the mean of the population. We selected 50 samples of N = 3 and calculated the mean income. We generally use the sample mean (Ῡ) as an estimate of the population mean (μ).
The Sampling Distribution of the Mean: For illustration, we generated an approximation of the sampling distribution of the mean, consisting of 50 samples of N = 3. The sampling distribution of the mean does not really exist. It is a theoretical distribution.
To help you understand the relationship among the population, the sample, and the sampling distribution, we have illustrated in Figure 6.5 the process of generating an empirical sampling distribution of the mean. From a population of raw scores (Ys), we draw M samples of size N and calculate the mean of each sample. The resulting sampling distribution of the mean, based on M samples of size N, shows the values that the mean could take and the frequency (number of samples) associated with each value. Make sure you understand these relationships. The concept of the sampling distribution is crucial to understanding statistical inference. In this and the following chapter, we learn how to employ the sampling distribution to draw inferences about the population on the basis of sample statistics.
Like the population and sample distributions, the sampling distribution can be described in terms of its mean and standard deviation. We use the symbol μῩ to represent the mean of the sampling distribution. The subscript indicates the specific variable of this sampling distribution. To obtain the mean of the sampling distribution, add all the individual sample means (∑¯¯¯Y=1,237,482)(∑Y¯=1,237,482) and divide by the number of samples (M = 50). Thus, the mean of the sampling distribution of the mean is actually the mean of means:
The standard deviation of the sampling distribution is also called the standard error of the mean. The standard error of the mean describes how much dispersion there is in the sampling distribution, or how much variability there is in the value of the mean from sample to sample:
Standard error of the mean The standard deviation of the sampling distribution of the mean. It describes how much dispersion there is in the sampling distribution of the mean.
This formula tells us that the standard error of the mean is equal to the standard deviation of the population σ divided by the square root of the sample size (N). For our example, because the population standard deviation is 14,687 and our sample size is 3, the standard error of the mean is
In Figure 6.6a and b, we compare the histograms for the population and sampling distributions of Tables 6.3 and 6.4. Figure 6.6a shows the population distribution of 20 incomes, with a mean μ = 22,766 and a standard deviation σ = 14,687. Figure 6.6b shows the sampling distribution of the means from 50 samples of N = 3 with a mean μῩ = 24,750 and a standard deviation (the standard error of the mean) σῩ.= 8,480 These two figures illustrate some of the basic properties of sampling distributions in general and the sampling distribution of the mean in particular.
First, as can be seen from Figure 6.6a and b, the shapes of the two distributions differ considerably. Whereas the population distribution is skewed to the right, the sampling distribution of the mean is less skewed—that is, it is closer to a symmetrical, normal distribution.
Second, whereas only a few of the sample means coincide exactly with the population mean, $22,766, the sampling distribution centers on this value. The mean of the sampling distribution is a pretty good approximation of the population mean.
In the discussions that follow, we make frequent references to the mean and standard deviation of the three distributions. To distinguish among the different distributions, we use certain conventional symbols to refer to the means and standard deviations of the sample, the population, and the sampling distribution. Note that we use Greek letters to refer to both the sampling and the population distributions.
The Population: We began with the population distribution of 20 individuals. This distribution actually exists. It is an empirical distribution that is usually unknown to us. We are interested in estimating the mean income for this population.
The Sample: We drew a sample from that population. The sample distribution is an empirical distribution that is known to us and is used to help us estimate the mean of the population. We selected 50 samples of N = 3 and calculated the mean income. We mostly use the sample mean as an estimate of the population mean μ.
The Sampling Distribution of the Mean: For illustration, we generated an approximation of the sampling distribution of the mean, consisting of 50 samples of N = 3. The sampling distribution of the mean does not really exist. It is a theoretical distribution.
Sampling distribution of the mean
Third, the variability of the sampling distribution is considerably smaller than the variability of the population distribution. Note that the standard deviation for the sampling distribution ( σῩ.= 8,480) is almost half that for the population (σ = 14,687).
These properties of the sampling distribution are even more striking as the sample size increases. To illustrate the effect of a larger sample on the shape and properties of the sampling distribution, we went back to our population of 20 individual incomes and drew 50 additional samples of N = 6. We calculated the mean for each sample and constructed another sampling distribution. This sampling distribution is shown in Figure 6.6c. It has a mean μῩ = 24,064 and a standard deviation σῩ = 5,995. Note that as the sample size increased, the sampling distribution became more compact. This decrease in the variability of the sampling distribution is reflected in a smaller standard deviation: With an increase in sample size from N = 3 to N = 6, the standard deviation of the sampling distribution decreased from 8,480 to 5,995. Furthermore, with a larger sample size, the sampling distribution of the mean is an even better approximation of the normal curve.
These properties of the sampling distribution of the mean are summarized more systematically in one of the most important statistical principles underlying statistical inference. It is called the central limit theorem, and it states that if all possible random samples of size N are drawn from a population with a mean μ and a standard deviation σ, then as N becomes larger, the sampling distribution of sample means becomes approximately normal, with mean μῩ equal to the population mean and a standard deviation equal to
The significance of the central limit theorem is that it tells us that with a sufficient sample size the sampling distribution of the mean will be normal regardless of the shape of the population distribution. Therefore, even when the population distribution is skewed, we can still assume that the sampling distribution of the mean is normal, given random samples of large enough size. Furthermore, the central limit theorem also assures us that (a) as the sample size gets larger, the mean of the sampling distribution becomes equal to the population mean and (b) as the sample size gets larger, the standard error of the mean (the standard deviation of the sampling distribution of the mean) decreases in size. The standard error of the mean tells how much variability in the sample estimates there is from sample to sample. The smaller the standard error of the mean, the closer (on average) the sample means will be to the population mean. Thus, the larger the sample, the more closely the sample statistic clusters around the population parameter.
Central limit theorem If all possible random samples of size N are drawn from a population with a mean μ and a standard deviation σ, then as N becomes larger, the sampling distribution of sample means becomes approximately normal, with mean μῩ equal to the population mean and a standard deviation equal to
Although there is no hard-and-fast rule, a general rule of thumb is that when N is 50 or more, the sampling distribution of the mean will be approximately normal regardless of the shape of the distribution. However, we can assume that the sampling distribution will be normal even with samples as small as 30 if we know that the population distribution approximates normality.
Learning Check 6.5
What is a normal population distribution? If you can’t answer this question, go back to Chapter 5. You must understand the concept of a normal distribution before you can understand the techniques involved in inferential statistics.
In the preceding sections, we have covered a lot of abstract material. You may have a number of questions at this time. Why is the concept of the sampling distribution so important? What is the significance of the central limit theorem? To answer these questions, let’s go back and review our 20 incomes example.
To estimate the mean income of a population of 20 individuals, we drew a sample of three cases and calculated the mean income for that sample. Our first sample mean, Ῡ = 20,817, differs from the actual population parameter, μ = 22,766. When we selected different samples, we found each time that the sample mean differed from the population mean. These discrepancies are due to sampling errors. Had we taken a number of additional samples, we probably would have found that the mean was different each time because every sample differs slightly. Few, if any, sample means would correspond exactly to the actual population mean. Usually we have only one sample statistic as our best estimate of the population parameter.
So now let’s restate our dilemma: If sample estimates vary and if most result in some sort of sampling error, how much confidence can we place in the estimate? On what basis can we infer from the sample to the population?
The solution lies in the sampling distribution and its properties. Because the sampling distribution is a theoretical distribution that includes all possible sample outcomes, we can compare our sample outcome with it and estimate the likelihood of its occurrence.
Since the sampling distribution is theoretical, how can we know its shape and properties so that we can make these comparisons? Our knowledge is based on what the central limit theorem tells us about the properties of the sampling distribution of the mean. We know that if our sample size is large enough (at least 50 cases), most sample means will be quite close to the true population mean. It is highly unlikely that our sample mean would deviate much from the actual population mean.
In Chapter 5, we saw that in all normal curves, a constant proportion of the area under the curve lies between the mean and any given distance from the mean when measured in standard deviation units, or Zscores. We can find this proportion in the standard normal table (Appendix B).
Knowing that the sampling distribution of the means is approximately normal, with a mean μῩ and a standard deviation σ/√Nσ/N (the standard error of the mean), we can use Appendix B to determine the probability that a sample mean will fall within a certain distance—measured in standard deviation units, or Z scores—of μῩ or μ. For example, we can expect approximately 68% (or we can say the probability is approximately 0.68) of all sample means to fall within ±1 standard error (σ/√N)(σ/N) , or the standard deviation of the sampling distribution of the mean of μῩ or μ. Similarly, the probability is about 0.95 that the sample mean will fall within ±2 standard errors of μῩ or μ. In the next chapter, we will see how this information helps us evaluate the accuracy of our sample estimates.
Learning Check 6.6
Suppose a population distribution has a mean μ = 150 and a standard deviation s = 30, and you draw a simple random sample of N = 100 cases. What is the probability that the mean is between 147 and 153? What is the probability that the sample mean exceeds 153? Would you be surprised to find a mean score of 159? Why? (Hint: To answer these questions, you need to apply what you learned in Chapter 5about Z scores and areas under the normal curve [Appendix B].) To translate a raw score into a Z score we used this formula:
However, because here we are dealing with a sampling distribution, replace Y with the sample mean Ῡ, Ῡ with the sampling distribution’s mean μῩ, and σ with the standard error of the mean.
There are numerous applications of the central limit theorem in research, business, medicine, and popular media. As varied as these applications may be, what they have in common is that the data are derived from relatively small random samples taken from considerably larger and often varied populations. And the data have consequence—informing our understanding of the social world, influencing decisions, shaping social policy, and predicting social behavior.
In November 6, 2012, Barack Obama was reelected president of the United States with 49% of the vote. Governor Mitt Romney, the Republican candidate, received 46% of the votes. Weeks before the election took place, several surveys correctly predicted an Obama victory within 2 or 3 percentage points of the actual result. These predictions were based on interviews conducted with samples no larger than about 2,000 registered voters. What is astounding about these surveys is that their predictions were based on a single, smaller sample of the voting population.
But not all election polls predicted an Obama victory. Romney and his campaign staff believed that he would win the election as the campaign’s internal polling showed Romney leading in several key and swing states. Days before the election, their survey results indicated that Romney was 2.5 points ahead of Obama in Colorado. In the end, the Republican candidate lost the state by 5.4 points. After the election, the campaign’s chief pollster, Neil Newhouse, admitted that the biggest flaw of their polling was “the failure to predict the demographic composition of the electorate.” As described by Norm Scheiber (2012), “The people who showed up to vote on November 6 were younger and less white than Team Romney anticipated, and far more Democratic as a result.”8
Romney campaign pollsters were not the only ones who erred. Throughout the presidential campaign, the Gallup Poll consistently reported a lead by the Republican candidate. The day before the election, Gallup’s final preelection survey gave Romney a one-point lead over Obama, 49% versus 48%. After a postelection assessment, Gallup identified four factors that contributed to the difference between its estimates and the final election results. Though part of Gallup’s prediction problem involved the mathematical weighting of responses (not the focus of our discussion), Gallup admitted to serious sampling missteps. The organization reported that it (a) overestimated the number of voters most likely to vote for Romney, (b) completed more interviews in pro-Romney geographic regions, (c) sampled too many white voters while undersampling Hispanic and black voters, and (d) relied on a listed landline sample that included older and more Republican voters.9
Data at Work
Photo courtesy of Emily Treichler
As an undergraduate, Dr. Treichler wanted to figure out a way to make mental health treatment more effective and more accessible. Having completed her Ph.D., currently she is postdoctoral fellow conducting research on schizophrenia and other related disorders in a VA hospital. “I divide my time between research, clinical work, and other kinds of training, including learning new methods. I conduct research in clinical settings working with people who are experiencing mental health problems, and use the results of my research and other research literature to try to improve mental health treatment.”
“I use statistics and methods constantly. I read research literature in order to learn more about my area, to apply in clinical situations, and to apply it to my own research. I conduct clinical research, using both qualitative and quantitative methodology, and conducting statistics on quantitative data. I collect data in settings I work as a clinician and conduct statistics on that data in order to understand how our services are working, and how to improve clinical services.”
According to Treichler, “Quantitative research can be an incredibly fun area.” For students interested in the field, she advises, “Get a wide range of training in statistics and methods so you can understand the literature in your area, and have access to multiple methods for your own studies. Choosing appropriate methods and statistics given your research question and the literature in your area is key to creating a successful project.”
The Romney campaign and Gallup’s presidential polling failures help underscore the value of representative samples. Since most social science research is based on sample data, it is imperative that our samples accurately reflect the populations we’re interested in. If not, we won’t be able to make appropriate and meaningful inferences about the population, the primary goal of inferential statistics.
Gallup implemented new sampling and calling procedures to improve its polling. So it was a surprise when Gallup Editor-in-Chief Frank Newport announced in late 2015 that the organization would not conduct any polling for the 2016 presidential primary or general election. According to Newport, Gallup would still conduct polls about broader social and political issues.
Get the edge on your studies. edge.sagepub.com/frankfort8e
Take a quiz to find out what you’ve learned.
Review key terms with eFlashcards.
Dive into real research with SAGE Journal Articles.
SPSS DemonstratioN [GSS14SSDS-B]
In this chapter, we’ve discussed various types of samples and the definition of the standard error of the mean. Usually, data entered into SPSS have already been sampled from some larger population. However, SPSS does have a sampling procedure that can take random samples of data. Systematic samples and stratified samples can also be drawn with SPSS, but they require the use of the SPSS command language.
When might it be worthwhile to use the SPSS Sample procedure? One instance is when doing preliminary analysis of a very large data set. For example, if you worked for your local hospital and had complete data records for all patients (tens of thousands), there would be no need to use all the data during initial analysis. You could select a random sample of individuals and use the subset of data for preliminary analysis. Later, the complete patient data set could be used for completing your final analyses.
To use the Sample procedure, click on Data from the main menu, then click on Select Cases. The opening dialog box (Figure 6.7) has five choices that will select a subset of cases via various methods. By default, the All cases button is checked. We click on the Random sample of cases button, then on the Sample button to give SPSS our specification.
The next dialog box (Figure 6.8) provides two options to create a random sample. The most convenient one is the first, where we tell SPSS what percentage of cases to select from the larger file. Alternatively, we can tell SPSS to take an exact number of cases. The second option is available because SPSS will only take approximately the percentage specified in the first option.
We type “10” in the box to ask for 10% of the original sample of 1,500 respondents from the GSS. Then, click on Continue and OK, as usual, to process the request.
SPSS does not delete the cases from the active data file that aren’t selected for the sample. Instead, they are filtered out (you can identify them in the Data View window by the slash across their row number). This means that we can always return to the full data file by going back to the Select Cases dialog box and selecting the All cases button.
When SPSS processes our request, it tells us that the data have been filtered by putting the words “Filter On” in the status area at the bottom of the SPSS window (the status area has many helpful messages from SPSS).
To demonstrate the effect of sampling, we ask for univariate statistics for the variable HRS1, measuring the number of hours a respondent worked last week. Click on Analyze, Descriptive Statistics, and then Descriptives to open this dialog box. Place HRS1 in the variable list. Click on the Options button to select the mean, standard deviation, minimum, and maximum values. In addition, we’ll add the standard error of the mean by clicking the S.E. mean box. Then, click Continue and OK to put SPSS to work.
The results (Figure 6.9) show that the number of valid cases is exactly 87, or 10% of the valid cases (those who responded to the number of hours worked last week). The mean of HRS1 is 42.38, and the standard error of the mean is 1.78. If we repeat the process, this time asking for a 25% sample, we obtain the results shown in Figure 6.10.
Your results may differ from the results presented here. We are asking SPSS to generate a random selection of cases, and you may not get the same selection of cases as we did.
How closely does the mean for HRS1 from these two random samples match that of the full file? The mean for all 895 respondents (the other 605 respondents did not have valid responses) is 41.47 years. Both samples produced means and standard deviations that are within the range of the population parameters.
SPSS ProblemS [GSS14SSDS-B]
Using GSS14SSDS-B, repeat the SPSS demonstration, selecting 25%, 50%, 75%, and 100% samples and requesting descriptives for MAEDUC and PAEDUC. Compare your descriptive statistics with descriptives for the entire sample. What can you say about the accuracy of your random samples?
Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.Read more
Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.Read more
Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.Read more
Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.Read more
By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.Read more