Problem

This problem continues the analysis of the chromatin data from Problem 45 of Chapter 8 a...

This problem continues the analysis of the chromatin data from Problem 45 of Chapter 8 and is concerned with further examining goodness of fit.

a. Goodness of fit can also be examined via probability plots in which the quantiles of a theoretical distribution are plotted against those of the empirical distribution. Following the discussion in Section 9.8, show that it is sufficient to plot the observed order statistics, X(k), versus the quantiles of the Rayleigh distribution with θ = 1. Construct three such probability plots and comment on any systematic lack of fit that you observe. To get an idea of what sort of variability could be expected due to chance, simulate several sets of data from a Rayleigh distribution and make corresponding probability plots.

b. Formally test goodness of fit by performing a chi-squared goodness of fit test, comparing histogram counts to those predicted from the Rayleigh model. You may need to combine cells of the histograms so that the expected counts in each cell are at least 5.

Reference

A Random Walk Model for Chromatin. A human chromosome is a very large molecule, about 2 or 3 centimeters long, containing 100 million base pairs (Mbp). The cell nucleus, where the chromosome is contained, is in contrast only about a thousandth of a centimeter in diameter. The chromosome is packed in a series of coils, called chromatin, in association with special proteins (histones), forming a string of microscopic beads. It is a mixture of DNA and protein. In the G0/G1 phase of the cell cycle, between mitosis and the onset of DNA replication, the mitotic chromosomes diffuse into the interphase nucleus. At this stage, a number of important processes related to chromosome function take place. For example, DNA is made accessible for transcription and is duplicated, and repairs are made of DNA strand breaks. By the time of the next mitosis, the chromosomes have been duplicated. The complexity of these and other processes raises many questions about the large-scale spatial organization of chromosomes and how this organization relates to cell function. Fundamentally, it is puzzling how these processes can unfold in such a spatially restricted environment. At a scale of about 103 Mbp, the DNA forms a chromatin fiber about 30 nm in diameter; at a scale of about 101 Mbp the chromatin may form loops. Very little is known about the spatial organization beyond this scale. Various models have been proposed, ranging from highly random to highly organized, including irregularly folded fibers, giant loops, radial loop structures, systematic organization to make the chromatin readily accessible to transcription and replication machinery, and stochastic configurations based on random walk models for polymers.

A series of experiments (Sachs et al., 1995; Yokota et al., 1995) were conducted to learn more about spatial organization on larger scales. Pairs of small DNA sequences (size about 40 kbp) at specified locations on human chromosome 4 were flourescently labeled in a large number of cells. The distances between the members of these pairs were then determined by flourescence microscopy. (The distances measured were actually two-dimensional distances between the projections of the paired locations onto a plane.) The empirical distribution of these distances provides information about the nature of large-scale organization.

There has long been a tradition in chemistry of modeling the configurations of polymers by the theory of random walks. As a consequence of such a model, the two-dimensional distance should follow a Rayleigh distribution

Basically, the reason for this is as follows: The random walk model implies that the joint distribution of the locations of the pair in R3 is multivariate Gaussian; by properties of the multivariate Gaussian, it can be shown the joint distribution of the locations of the projections onto a plane is bivariate Gaussian. As in Example Aof Section 3.6.2 of the text, it can be shown that the distance between the points follows a Rayleigh distribution. In this exercise, you will fit the Rayleigh distribution to some of the experimental results and examine the goodness of fit. The entire data set comprises 36 experiments in which the separation between the pairs of flourescently tagged locations ranged from 10 Mbp to 192 Mbp. In each such experimental condition, about 100–200 measurements of two-dimensional distances were determined. This exercise will be concerned just with the data from three experiments (short, medium, and long separation). The measurements from these experiments is contained in the filesChromatin/short, Chromatin/medium, Chromatin/long.

a. What is the maximum likelihood estimate of θ for a sample from a Rayleigh distribution?

b. What is the method of moments estimate?

c. What are the approximate variances of the mle and the method of moments estimate?

d. For each of the three experiments, plot the likelihood functions and find the mle’s and their approximate variances.

e. Find the method of moments estimates and the approximate variances.

f. For each experiment, make a histogram (with unit area) of the measurements and plot the fitted densities on top. Do the fits look reasonable? Is there any appreciable difference between the maximum likelihood fits and the method of moments fits?

g. Does there appear to be any relationship between your estimates and the genomic separation of the points?

h. For one of the experiments, compare the asymptotic variances to the results obtained from a parametric bootstrap. In order to do this, you will have to generate random variables from a Rayleigh distribution with parameter θ.

Show that if X follows a Rayleigh distribution with θ = 1, then Y = θ X follows a Rayleigh distribution with parameter θ. Thus it is sufficient to figure out how to generate random variables that are Rayleigh, θ = 1. Show how Proposition D of Section 2.3 of the text can be applied to accomplish this. B = 100 bootstrap samples should suffice for this problem. Make a histogram of the values of the θ. Does the distribution appear roughly normal? Do you think that the large sample theory can be reasonably applied here? Compare the standard deviation calculated from the bootstrap to the standard errors you found previously.

i. For one of the experiments, use the bootstrap to construct an approximate 95% confidence interval for θ using B = 1000 bootstrap samples. Compare this interval to that obtained using large sample theory.

Step-by-Step Solution

Request Professional Solution

Request Solution!

We need at least 10 more requests to produce the solution.

0 / 10 have requested this problem solution

The more requests, the faster the answer.

Request! (Login Required)


All students who have requested the solution will be notified once they are available.
Add your Solution
Textbook Solutions and Answers Search