STAT701 – Assignment 2. 2
QUESTION 1 (20 marks) The data set States.csv contains a number of Education-Indices
and related statistics for the 50 states of the U.S. plus Washington D.C. for this reason,
such file has 52 rows (including column-names in the first row), as well as the following 8
columns:
State
U.S. State code
Region
U.S. Census regions. A factor with levels: ENC, East North Central; ESC, East South
Central; MA, Mid-Atlantic; MTN, Mountain; NE, New England; PAC, Pacific; SA,
South Atlantic; WNC, West North Central; WSC, West South Central.
pop
Population: in 1,000s.
SATV
Average score of graduating high-school students in the state on the verbal component
of the Scholastic Aptitude Test (a standard university admission exam).
SATM
Average score of graduating high-school students in the state on the math component
of the Scholastic Aptitude Test.
percent
Percentage of graduating high-school students in the state who took the SAT exam.
dollars
State spending on public education, in $1000s per student.
pay
Average teacher’s salary in the state, in $1000s.
The aim in this question is to somewhat rank the 51 states &Washington-Area according to
the above indices (except State and Region) based on an analysis of principal components.
Recall that a principal component analysis aims to transform the Y variables into a smaller
number of principal components that account for most of the variance of the these variables.
The plots of the first few components (usually the first two) can reveal useful information
about the distribution of the data, such as identifying different groups of the data or
identifying observations with extreme values (possible outliers).
*SAS code must be provided as part of your answer in a) – c), and e) below*
a) (1 mark code + 1 mark explanation) Read the data into SAS. Display and explain
the output from a PROC CONTENTS. Make sure that State and Region are read in
as a character variables.
b) (1 mark code + 1 marks output) Using PROC STANDARD, standardize the numeric
variables to have mean zero and unit variance. Show the first 10 observations.
STAT701 – Assignment 2. 3
c) (1 mark code + 2 marks identification) Compute the correlation matrix of the numeric
variables only. Identify and write down the set of variables (if any) that can be termed
as moderately-to-highly correlated.
d) (2 marks) In part e) below you will carry out an analysis of principal components.
Based on our answer in c), would you contemplate indices that can be associated to
PC1 and PC2? Justify your answer.
e) (1 mark code) Use PROC PRINCOMP to run a principal component analysis in SAS
using the correlation matrix. Provide only your SAS code, that will be considered as
the answer for this question. No output is required.
f) (2 marks) By looking at the matrix of eigenvectors, which variables/indices appear
with the largest influence on the first two PCs? Justify your answer.
g) (2 marks) For simplicity, the first two PCs are commonly retained in order to plot the
data vs. PCs in the plane (R2). By analyzing the scree plot and the plot of cumulative
variance explained, would you suggest to retain the first two PCs as a means to reveal
useful information for the 50 states + DC as well as an adequate representation of
such indices? Why? Justify your answer.
h) (3 marks i. + 3 marks ii.) Plot the first two components using the option plots(ncomp
= 2) = all. Based on your answer in f) and this plot: i. interpret both components in
terms of the ‘represented’ underlying attribute, and ii. identify and relate U.S. states
to each component (Hint: Get a scatterplot PC1 vs PC2 for the 51 observations).
Interpretation must be provided in terms of the problem.
QUESTION 2 (18 marks) You work as data analyst for the company NZ Surveys LTD,
and your manager has come up with a new task for you. The company Pharmaceutics-
Auckland has conducted an experiment to study the effect of three drugs (Drug 1, Drug
2 and Drug 3) on the recovery time of mice that had a certain disease. Forty mice were
randomly assigned to four groups, as follows: Group 1 (10 mice) were given three injections
of Drug 1, Group 2 (10 mice) were given three injections of Drug 2, Group 3 (10 mice)
were given three injections of Drug 3 and Group 4 were given a placebo-drug. The recovery
times were recorded (in hours) as shown in the following data set:
Group 1 Group 2 Group 3 Group 4
53.2 51.2 71.6 89.7
49.9 55.1 75.6 83.2
51.2 48.0 69.4 91.5
48.5 53.1 67.1 79.4
47.3 57.5 76.5 87.5
48.0 52.0 71.3 85.5
52.8 53.9 77.8 80.7
47.4 54.6 65.3 77.9
51.6 54.8 74.1 75.6
51.9 53.7 73.7 81.5
STAT701 – Assignment 2. 4
‘Pharmaceutics-Auckland’ is currently producing and selling pill packs of Drug 2, and they
are planning to start producing and selling pills of Drug 1 and Drug 3. However, the
production cost of Drug 1 is around 5.7% higher than the production cost of Drug 2, whilst
producing Drug 3 will cost an additional 17% to this company compared to Drug 2.
Any SAS output relevant to your answers in this question must be included in
an Appendix and properly referred to – Don’t include outputs in yours answer below.
a) (1 mark) Read in this data into SAS. Give the code here as it will be marked.
b) (2 marks) Is this experimental design a ‘completely randomized design’? Justify your
answer.
c) (1 mark – plot + 2 marks reasoning) Based on a graphical assessment of the data,
would you advise significant differences between the recovery times from Drug 1 and
Drug 2? Justify your answer. Here, you need to propose appropriate plots according
to the type of data analyzed (include such plot in your answer).
d) (3 marks) What would be a suitable methodology to analyze the data with results
that may provide ‘Pharmaceutics-Auckland’ with adequate insights towards their interests?
Why? Justify your answer. You must include (whenever applies) the null
and alternative hypotheses associated to this method.
e) (1 mark) Generate SAS code to run this methodology/model in SAS. Give the code
here as it will be marked.
f) Run the selected model in SAS and based on the results write down a short report for
the company ‘Pharmaceutics-Auckland’ with 2–3 paragraphs (3 – 4 sentences each),
including:
1. (3 marks) Relevant information and conclusions from your analysis and
2. (5 marks) recommendations for this company towards their economic interests to
start producing the new brand of drugs (highlighted in red above). For instance:
Would you suggest to make and sell Drug 1 and Drug 3?
QUESTION 3 (12 marks) In this question we want to determine whether national figures
for birth rates, death rates, and infant death rates can be used to categorize countries.
Previous studies indicate that the clusters computed from this type of data can be elongated,
elliptical and irregular. Thus, you need to perform a linear transformation on the
raw data before the cluster analysis. The script clusterAssignment2.sas contains code
to conduct a cluster analysis with SAS on the data Poverty. These data have been compiled
from the United Nations Demographic Yearbook 1990 (United Nations publications, Sales
No. E/F.91.XII.1, copyright 1991, United Nations, New York) and are reproduced with
the permission of the United Nations.
a) (2 marks) What is the purpose of PROC ACECLUS in this analysis? Justify your answer.
b) (2 mark)What is the method/linkage utilized in this analysis? Why is this method considered
here? HINT: Look at the PROC CLUSTER statement.
STAT701 – Assignment 2. 5
c) (2 marks) What is the purpose of PROC TREE in this analysis? Justify your answer.
d) (2 marks) PROC SGPLOT will give the scatter of the first two canonical variables, using the
value of the variable cluster as each point’s identifier. How many clusters are recommended
for this data set?
e) (4 marks) Write down a short paragraph (2–3 sentences) describing the dendrogram and the
scatter plot of the first two canonical variables.
QUESTION 4 (5 marks)
1. (2 marks) What’s the difference between an experiment and an observational study?
2. Write down the requested below (3 marks) An experiment was conducted to
explore the effect of vitamin B12 on the weight gain of pigs. Fourty similar piglets
were randomly divided into 5 groups. Five levels of vitamin B12 (0, 5, 10, 15, and 20
mgs./pound of corn meal ration) were randomly assigned to the 5 groups. At the end
of the treatment period, each piglet’s weight gain in pounds was recorded.
Response variable:
Factor:
Levels:
– END OF ASSIGNMENT 2 –