Muna Alkhabbaz

If I have a big sample size, is information technology acceptable to presume that data is normally distributed?

My sample size is around 400. I was told that based on central limit theorem, I tin presume normality of my information. Is that correct?

ResearchGate Logo

Get help with your inquiry

Join ResearchGate to enquire questions, get input, and advance your work.

Nearly recent answer

University of Strathclyde

CLT is a normal distribution of sample means.

So, you do not assume your data is normal. You presume the means are Normal

Popular Answers (1)

I think that the question that you really want to inquire is "If my sample is large, can I use parametric statistics with a non-normal distribution of the information (or more precisely, a non-normal distribution of the residuals)".

In cursory, my understanding is that the answer to the revised question is Yep. But I accept not have come across a definitive paper that gives guidelines for sample sizes for different types parametric procedures, and for unlike levels of deviation from normality (e.grand. values of skewness greater than one). All the same, I exercise recollect coming across a couple in the distant by that but looked at the t-exam, and the required numbers were much lower than your N of 400.

All the same, I am not sure if this allows yous to avoid the use of more advanced procedures, such as Generalized linear (mixed) models, that are specifically designed to model not-normal data. These are commonly used in the assay of big data sets by statisticians with much college levels of expertise than myself, so I assume that in that location must be proficient reasons. (Eg in highly skewed data at that place is commonly a tendency for a greater variance in scores among those cases with higher scores, leading to biased estimates - the so-called mean-variance clan trouble.) As well, in sure contexts, with highly skewed information, 1 might question the meaningfulness of modelling hateful values (rather than, say, medians or percentiles) which would be the results of parametric analyses.

All Answers (22)

No. Check out the central limit theorem in any mathematical statistics book. To brainstorm with it is about a sample mean. Are yous dealing with a sample hateful? Next it says that under certain conditions the distribution of a standardized sample mean approaches a standard normal distribution every bit the the sample size tends to infinity. It does not say what the rate of convergence is.  Now the theorem also requires using a unproblematic random sample. you accept not mentioned anything nearly how you took the sample and and so on.  I would advise you begin by doing some sort of probability plot to see if the sample looks at least approximately normal equally a get-go.  If it does there are statistical tests of normality. with a sample size of 400, I would recommend a Kolmogorov-Smirnov Test bold that your sample is a uncomplicated random sample.  All of these things and more than go into the use of the Central limit theorem. it does non say that if you have 400 of something that those 400 things follow a normal distribution. Please exist careful and utilise theorems advisedly or you will get wrong answers.

Cheers very much David for your answer. I heard this slice of information from a friend, but I wasn't convinced about it. That'southward why I thought it'due south ameliorate to ask experts in the field of statistics.

Going back to your recommendation of using Kolmogorov-Smirnov Test, that is a very sensitive test and even if data looks usually distributed using visual methods, Kolmogorov-Smirnov Test might testify that data is non ordinarily distributed. Are there any other tests that I tin can use to check for normality?

Independent researcher (formerly with Geoscience Australia now retired)

Hi Muna,

A very simple visual test to cheque whether a dataset follows a given distribution is a QQ plot. In this technique y'all compare the quantiles of your data against the quantiles of a standard distribution (a normal in your case). If they lie on a directly line then the answer is yes the given distribution follows the standard 1. A QQ plot can hands be implemented in R using the function 'qqplot()'. An example of the QQ implementation is here: http://data.library.virginia.edu/understanding-q-q-plots/

Hope it helps,

Augusto

University of Due west Florida

What is the planned statistical analysis that you would like to do, Muna? Is in that location a nonparametric version available to you?

Give thanks you lot Augusto for your respond. Yeah that instance is helpful.

Hello Raid, I applied unpaired t-test, where the nonparametric version is Mann-Whitney U test. Actually I practical both and got similar results. Simply I need to plot some graphs using mean values (not possible with the median every bit most had the same median). That'due south why I want to make sure that my information is normally distributed.

What about the values of skewness and kurtosis to presume normality? will they assistance?

Regards,

Muna

I nevertheless recommend K-S test. Information technology has been a standard for many years. a normal probability plot and a K-Due south examination Or Shapiro-Wilk exam should pin it downwardly. Best, david

University of West Florida

Hello Mona,

I would utilise normal quantiles (z scores) in place of the raw data. Have yous ever used such a ranks method? Information technology is explained in William Conover'south text Nonparametric Statistics.  It will limit the effects of any existing outliers and information technology besides will make the interpretation of the results easier for you.

If most groups had the same median value, did yous actually showtime out with a continuous variable, or did you measure the outcomes with a Likert calibration? If you have detached values, yous can prove descriptive statistics on the means and the standard deviations.

Cheers David for the recommendation.

Raid, I never used such a ranks method before, I will read well-nigh information technology and try to employ it if appropriate.

Aye, I did measure out outcomes with a Likert scale, I know some people consider information technology continuous while others consider it discrete. And then is it adequate to nowadays descriptive statistics on the means and SD (mainly to draw graphs) when applying non-parametric tests?

Academy of West Florida

If y'all used a Likert scale, each value bone discrete while average are viewed by many every bit existence continuous. Y'all could also display next bar chart of the private Likert scores.

Cheers Raid

I'm plotting a simple scattergram to correlate data from two surveys (staff and patient perceptions of quality). The relationship was clear when I used means, but when I thought of using not-parametric test (for patient data simply, as I had uncertainty nigh its normality), I replaced patients' mean with median, and the graph made not-sense (equally nearly had the aforementioned median).

In this case, can I still practice non-parametric test but use ways to draw the scattergram?

University of W Florida

Aye, you lot can. This is frequently done. The nonparametric test keeps the inferential part intact while using the actual scores shows what really is at that place.I take recently worked with such data on infirmary patients and nurses/ perceptions of quality of service.

Al-Mustansiriya University

Hullo all,

No, you tin can't assume the normality condition for items itselfe only yous can compute the total score for each individual (observation), and so, the full score variable (due north 10 i) may exist normal distribution. It is necessary checking the normality of that variable by using for example,  Statgraphics bundle   also, by drawing the histogram and chaking the Skewness and Kurtosis or using Kolmogorov-Smirnov Test and Chi-square test.

Regards,

Huda

University of Westward Florida

If I suspect non-normality, I switch to a nonparametric method and move on with the statistical analysis. In the past 35 years in doing statistical consulting I have non looked into kurtosis or skewness across inspecting a plot of the data. Keeping things simple works very well.

Thanks very much Huda,

I agree Raid. I think I would get with the non-parametric methods but utilize the means to draw the scattergrams.

Hope that will work.

Many thanks Raid

Don't forget that nonparametric procedures have assumptions to be tested and notwithstanding crave a probability sample. Best, David

Thank you David, I used systemic sampling, which is type of probability sampling, and I will consider all assumptions for not-parametric tests.

All-time regards, Muna

Italian National Plant of Statistics

the CLT says that the mean of your distribution is commonly ditributed as n increases non the information.

I recollect that the question that y'all actually want to enquire is "If my sample is big, tin can I use parametric statistics with a non-normal distribution of the data (or more precisely, a non-normal distribution of the residuals)".

In brief, my understanding is that the answer to the revised question is Yes. But I have non have come across a definitive paper that gives guidelines for sample sizes for unlike types parametric procedures, and for dissimilar levels of divergence from normality (eastward.m. values of skewness greater than i). However, I do recall coming across a couple in the distant by that just looked at the t-exam, and the required numbers were much lower than your Northward of 400.

Nevertheless, I am not sure if this allows you to avoid the use of more than advanced procedures, such equally Generalized linear (mixed) models, that are specifically designed to model non-normal data. These are normally used in the assay of large data sets past statisticians with much college levels of expertise than myself, so I assume that in that location must be good reasons. (Eg in highly skewed data at that place is commonly a tendency for a greater variance in scores among those cases with college scores, leading to biased estimates - the so-called hateful-variance association problem.) Also, in sure contexts, with highly skewed information, one might question the meaningfulness of modelling mean values (rather than, say, medians or percentiles) which would be the results of parametric analyses.

Universidad Miguel Hernández de Elche

Y'all may desire to have a look at this paper (10.1177/1073191116669784).

For its use in the calculation of sample size for normative data, the authors analyzed the minimum sample size at which both mean and standard divergence estimates remained within the ninety% conviction intervals surrounding the population estimates for different levels of skewness. They reported that "Sample sizes of greater than 85 were found to generate stable means and standard deviations regardless of the level of skewness, with smaller samples required in skewed distributions".

Hope this helps.

Kind regards,

Javi

Italian National Plant of Statistics

In general yous can't. Given a sample fatigued from X1,X2,...,Xn i.i.d. r.vs CLT requires that means and variances are finite for it to work. Then, for instance, if your Xjs are Cauchy-distributed (mean and variances practice not be) the CLT (as well as the LLN for that thing ) are non applicable. In other words, the sample mean of n Cauchys is notwithstanding Cauchy.

Shahid Beheshti University of Medical Sciences

  • The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger.
  • Sample sizes equal to or greater than 30 are considered sufficient for the CLT to hold.
  • A primal attribute of CLT is that the average of the sample means and standard deviations volition equal the population mean and standard Sufficientlydeviation.
  • A sufficiently large sample size can predict the characteristics of a population accurately.

All of the above sentences are nigh the average of the sample ways. If you accept merely 1 sample, so the southampling distribution of the mean is approximately normal when the sample size is larger than 30.

So CLT is not about samples, it is most the mean of samples (All this is maxim is that as you have more than samples , especially large ones, your graph of the sample ways will look more like a normal distribution). If we have just ane sample (which more often than not takes only i sample from the population), the CLT talks about the boilerplate of that sample.

I showed the higher up in R software:

#

#Create 100000 random samples from standard uniform distribution

Uniform<-runif(100000)

#

#The histogram of uniform distribution

hist(Uniform)

#

#The distribution of any samples of uniform was not normal

hist(sample(Uniform,100))

hist(sample(Uniform,thousand))

hist(sample(Compatible,10000))

#

#The distirbution of sample means of uniform was normal

hist(sapply(1:100,function(x) mean(sample(Compatible,100))))

hist(sapply(1:1000,function(x) mean(sample(Compatible,100))))

hist(sapply(1:10000,function(ten) mean(sample(Compatible,100))))

Similar questions and discussions

Related Publications

This paper provides an asymptotic evaluation of the quadrant probability P(Y1[less-than-or-equals, slant]b,...,Yt[less-than-or-equals, camber]b) every bit t-->[infinity], where the Yi's are exchangeable normals with a correlation [rho]. This probability is often represented every bit , where [Phi] is the standard normal distribution, and a=(i-[rho])/[rho].

Got a technical question?

Get loftier-quality answers from experts.