44 27 24 24 36 36 44 44 120 29 36 36 36
Create a variable named deathTime with this data
using the scan
function in the Commands Window.
[How?]
Use the quantile
function to find the five number summary
including lower and upper quartiles.
> quantile(deathTime) 0% 25% 50% 75% 100% 24 29 36 44 120
Boxplots in S-PLUS also identify potential outliers. The interquartile range (IQR) is the distasnce between the upper and lower quartiles seen graphically as the height of the box. Any individual observations that are more than 1.5 IQR units from the box are identified separately with a horizontal line. The whiskers extend to the maximal and minimal observations that are not potential outliers.
Skewness is easily seen in a boxplot. If one whisker and box half are longer than the other, the distribution is skewed. If the lower values are more spread out than the upper values, we say the data is skewed to the left. If the upper values are more spread out than the lower values, we say the data is skewed to the right. A distribution that is symmetric will not be skewed.
Find the HARVEST data set on the course Web page and save it to the Desktop. Import the data into S-PLUS. [How?]
Attach the data frame so that we may refer to variables by name. [How?]
Display the variable HRCB
in a boxplot
by following these directions.
You should check that the boxplot agrees with the calculated quartiles and median.
> quantile(HRCB,na.rm=T)
Is this distribution skewed or fairly symmetric?
Does the graph show markedly different blood pressures depending on smoking status? Does dosage (number of cigarettes per day) seem to matter much?
These approximations will not be very good for strongly skewed data.
The standard deviation of a single variable may be found in the Commands Window.
There is no built in function for the standard deviation,
but var
finds the variance and sqrt
finds the square root.
> sort(deathTime) [1] 24 24 27 29 36 36 36 36 36 44 44 44 120 > mean(deathTime) [1] 41.23077 > sqrt(var(deathTime)) [1] 24.68182
Sorting the data and calculating the mean are not necessary. Note that the single observation of 120 is an outlier and skews the distribution to the right. Only four observations are larger than the mean while nine are smaller and the mean is larger than the median. In this case, the standard deviation is not a "typical" deviation. The outlier 120 is much farther away than the mean while all other observations are substantially closer.
You can plot a boxplot of this data from the Commands Window as well.
> boxplot(deathTime)
See the skewness in the plot and notice the outlier at 120.
potass
(mg potassium per serving).
Describe how this plot indicates the presence of potential outliers.
Identify the brands of cereal which are outliers.
To do this, it may be helpful to display the potassium values
and cereal names,
sorted from largest potassium value to the smallest,
in the Commands Window.
> attach(cereal) > ord <- rev(order(potass)) > data.frame(name=name[ord],potass=potass[ord])
What characteristic of the cereal, apparent from the name, is common to the brands that are outliers?
sugar
versus shelf
.
Sketch this on your answer sheet.
Bret Larget, larget@mathcs.duq.edu