ftptest.leeds.ac.uk

home *** CD-ROM | disk | FTP | other *** search

/ ftptest.leeds.ac.uk / 2015.02.ftptest.leeds.ac.uk.tar / ftptest.leeds.ac.uk / bionet / maths / anova.asc < prev next >

Wrap

Text File | 1993-12-14 | 23KB | 300 lines

APNU1030 Data Handling Exercise Introduction In this exercise you will work on a quality control/quality assurance exercise using statistical analysis. Read all of this handout carefully before you start work. The exercise represents a full day of work for a professional person. Allowing for the fact that you may be less familiar with the work this exercise may take up to ten hours. You may be tempted to avoid the computer and try to do the data processing on paper. Please don't, for your own sake, as you will never get it finished and it will be impossible to avoid data entry mistakes on your calculator. If you make one slip with a decimal point the result will be out and you will have to work though all the calculations again. Check your progress with your tutor/demonstrator as you proceed if you wish. The Scenario You are supervising a food analysis laboratory. Your client is a margarine manufacturer and their brand of margarine has nutritional information printed on the package. Your job is to take samples from several batches of margarine, measure the amount of unsaturated fat and indicate whether this is consistent with the claims on the packaging. The packaging states that the fat component of the product is 53% saturated and 47% monounsaturated. You will include some data validation in order to convince your client of the reliability of your results. You have four technicians who are trained to make measurements and you will use forty replicate measurements on each of the five batches of margarine. (The client is very determined to get good results.) There are several questions you must answer. a) Are any of your technicians introducing consistently wrong results? b) Are the batches significantly different from each other? c) If the answer to b is 'no', is the 'global' mean from all batches taken together consistent with the manufacturers stated value allowing for your level of measurement error? In order to answer a) and b) you must arrange for each technician to make some measurements from each of the batches. If you assigned a batch to one technician you wouldn't be able to see if the different means were due to the margarine being genuinely different in different batches or if certain technicians were producing biased results. So each technician will make ten replicate measurements for each of the five batches of margarine. This means that each technician will make fifty measurements and each batch will be tested forty times. Because you have a lot of data the process would be very time consuming if done on paper with a pocket calculator. You will use a spreadsheet to help you come up with fast and reliable results. Your technicians will provide the data already entered into the spreadsheet and you will have to add the formulae for calculating the results. Obtaining the Data The technicians will send you the data in an Excel spreadsheet file attached to an electronic mail message. (Put in a request for data by mailing your APNU1030 tutor.) To get the file onto your disc wait for a message from the technicians. The message will indicate that there is an attachment to the message. Click on the attachments button of the message window. You will be shown a list of files attached to the message. There will only be one file in the list. Click on the filename to select it and use the save button. Another box pops up which allows you to choose which disc to put the file on. Unfortunately this box is a bit different to what you are used to. Directories and disc drives are listed together in a box marked 'Directories'. Scroll until you find your disc drive indicated as, for example, [-a-] and double click on it. Now you are on the correct disc drive you can click on the OK button to save the file there. Calculations The margarine you are testing contains 98% lipid. This is manufactured using an oil with three identical monounsaturated acyl groups, which has been partially hydrogenated in order to make it solid at room temperature. The percentage of saturated fat is taken to be 100% if all the available double bonds are hydrogenated and 0% if none of the double bonds have been hydrogenated. You may take the molecular weight of the saturated and unsaturated lipids to be both approximately 880 gmol-1 (The addition of two hydrogen atoms makes little difference to the molecular weight.) The values supplied by your technicians give the number of double bonds in mmol g-1. The packaging states the percentage of fat which is monounsaturated. You can analyse the raw data to answer the first two questions but to compare your mean with the stated value on the packaging you will either convert your calculated mean to a percentage or the stated percentage to millimoles of double bonds per gramme. You will have to arrange data from the technicians' spreadsheet in different ways for the different calculations so leave that spreadsheet intact and simply copy blocks of data into a separate sheet for each calculation. The Global Mean It is quite quick to test your 'grand mean' from all the data against the stated value on the packaging so you can start with this. You can go on to look at the data more closely at a later stage to test the quality assurance side of the project. If you are lucky your team will be up to scratch and this test of the mean will stand up. If not, you may have to repeat it! Testing the Mean Copy all the data onto a separate new spreadsheet as a block. It doesn't matter what order it goes in as long each datum is pasted in only once. Give the block of data the name 'data'. (See your course manual.) Use some convenient cells on the spreadsheet to enter formulae that will give you the number of datum points, the mean and standard deviation. =COUNT(data) =AVERAGE(data) =STDEV(data) Name the cells with these results in 'N', 'mean', and 's' respectively. You will either have to convert these figures to percentage saturation using more cells on the spreadsheet or use some cells to convert the manufacturer's stated value into mmol g-1. This will be necessary in order to make the comparison. You now have a mean calculated from a large number of measurements which you can compare with the stated value on the margarine's packaging. How do you do it? First you need to estimate your confidence in the measured mean. This is the standard error of the mean. This standardised measure of the experimental error depends on the spread of the measurements but also the number of measurements used to estimate the mean. Divide your standard deviation by the square root of the number of measurements used to estimate the mean. This is your standard error. You can see that as you make more measurements the standard error decreases and you are therefore more confident about your mean value. Standard error is a measure of the range of likely values for the actual saturated fat content of the margarine. There is an 84% chance (worked out from the normal distribution) that the actual value lies within the range of the estimated mean plus or minus the standard error. If you compare your mean value with the expected you can test the hypothesis that the measured value is consistent with the expected value. In other words, what is the probability that your measured mean could have resulted from measurements of saturated fat content, with normal experimental error, of margarine of the stated saturated fat content. If the probability is very low then you will have satisfactorily proved that the level in the supplied samples is not as stated on the packaging. Calculate the deviation of your measured mean from the expected (hypothetical) mean (a simple subtraction). You must scale this in terms of the standard error. Simply divide by the standard error. You now have a measure of how far the expected value is from the measured mean scaled according to your confidence in the measurement. A deviation between two means (measured or hypothetical) scaled in this way is usually called 't'. Now you need to know the probability. When you look at the deviation of a measurement from a mean the deviation is scaled to standard deviation. When you look at the deviation between two means the deviation is scaled to standard error. In either case it is the normal distribution itself which relates the scaled deviation to a probability of reaching such a deviation by chance alone. However, because the normal distribution is described partly by the standard deviation and because this is only estimated by your data, if you have small numbers of datum points the distribution is adjusted. That is why the t distribution is used. The t distribution is an adjusted version of the normal distribution for different numbers of data points. The deviation you calculate is referred to as t. The calculation of p from d is complex to perform on paper and the calculation of p from t is more complex because you must involve the number of measurements as well, but the spreadsheet program you are using has a simple function for calculating either. (If you were doing this on paper you would probably accept that, with the large amount of data you have, you have such a good estimate of standard deviation that a normal distribution would be acceptable - the deviation is called d instead of t.) On a spreadsheet enter a formula like this; =TDIST( t, count-1, TRUE ) The TRUE in the function makes it perform a two tailed test. This is needed because you are concerned about the measurement being too low as well as too high. If the probability is less than 0.05 you can confidently reject the value stated on the packaging. Comparing Means Assume initially that the various batches of margarine are all the same at first. It is safe to look at the four mean values each of your technicians produced (using all the batches) and compare them. Are they consistent with each other? You may have already met t tests as a way of comparing two means. Analysis of variance is an alternative way of comparing means. It allows you to compare two means like a t test but can be extended to compare any number of means. Analysis of variance is also a little easier to understand than the various t tests. Variation Variation between individual values in our situation could be due to unavoidable measurement error or due to consistent error introduced by the technician (we are ignoring any possible variation due to batches being different for the moment.) So if you take all 240 measurements and calculate the variance these two forms of variation will both be contributing to it. If you could separate the two contributions and compare them it will show you if consistent error, due to the technician, is real and significant That is what analysis of variance is about. You calculate the two variance values which contribute to the total variance. If variation introduced by the technician is significant then you will have demonstrated that the different means calculated by different technicians are really at odds with each other. The technique is called one factor analysis of variance because, apart from the error in the measurement process you are looking at how one factor (technician bias) affects the measured means. You should remember that variance is calculated as divided by the degrees of freedom (usually N-1). Because of this division the component variance values are not directly additive but the component values are additive. For this reason you work with initially and then work out the different variance values. Sum of Squares of Deviations from the Mean This value, when calculated by hand or pocket calculator, is easier to work out in a different way; = This second method gives an identical result but doesn't require you to work out the mean in advance and then subtract each datum from it. If you use a computer to work out the value using a built in function you need not worry about how it is done. For this reason the first formula is used throughout this text as it describes the statistic better, i.e. it shows how much your data varies from its own mean. The square makes sure that each individual deviation is positive so low values contribute as much as high values. Finding is one step away from the variance (divide by degrees freedom) and two steps away from the standard deviation (find square root of variance). In some text books this value is called 'sums of squares' for short. Don't be confused by this. Comparing Technicians' Means You need to get four columns of data together on your spreadsheet each containing the sixty measurements made by each technician. There is no need to keep different batches separate for this calculation so simply cut and paste to produce four long columns. Leave a space of ten or more rows at the top of the sheet to allow space for the statistical results. Graphical Method Before you do the statistical analysis (one factor ANOVA) you can use a graphical method to compare technicians. Produce a histogram for each technician's set of data. This involves counting data points that fall in certain ranges and then graphing the frequency totals in a column chart. Fortunately the spreadsheet program you are using has a facility for automating this. What you will have to do though is set up a table of 'bins'. A bin is a category for counting. It is defined by giving the lowest acceptable value for the bin. A datum between that value and the value for the next bin up will be counted. A data point equal to the bin level will also be counted. So, to the side of your data set up a column of figures starting with 0 and increasing in 0.2 steps to beyond the highest measurement you have. Now, select the column of data for your first technician and choose the analysis tools command on the options menu. You will see that the input range box has already been filled in with your data selection. You must now click on the bin range box and enter the range of cells where your column of bin values will be found. (In excel dialogue boxes, instead of typing a cell reference you can click on the spreadsheet to indicate a region - this speeds things up.) Then click on the output box and enter the cell where you want the results to be placed. There are three check boxes on the dialogue box - Pareto and Cumulative Percentage should be switched off but the Chart Output option can be switched on. Click on the O.K. button to obtain the results. You may have to change the title of the chart before you save it to disc. Repeat this for each technician. You now have a graphical way of comparing data sets; can you pick out any set of data as being significantly biased compared to the others? To really be certain you will have to work out the statistics. Numerical Method It will be useful to give names to areas of data before you start. Select all four columns of data and use the appropriate command to name this 'data'. Name each column 'datax', 'datay', 'dataz' and 'dataw'. To start with you want to know the global degrees freedom, mean, sums of squares of deviation from the mean and variance. Use a space to the right of the data to enter formulae for these and label them using neighbouring cells. =COUNT(data)-1 =AVERAGE(data) =DEVSQ(data) For variance you must enter a formula which refers to two cells above and does a division. Next you must start looking at the separate means. Above each column enter a formula of the type; =AVERAGE(datax) Have a look at the means. They will certainly be each different to the overall mean but is this due to the normal error associated with making the measurements or due to the technicians introducing consistent error? This is where the analysis of variance comes in. Imagine that a technician always makes a measurement 2.5 units too high. The normal error will give measurements spread out either side of a mean shifted 2.5 units up instead of either side of the real mean. So if you look at deviation from the technicians own mean you will see the variance due only to the normal measurement error without the extra deviation due to the 2.5 unit shift. So this is how to calculate the error variance; In cells above the four data columns calculate individual sums of squares of deviations from the individual mean, e.g.; =DEVSQ(datax) Somewhere on your spreadsheet add up the four separate values. You now have the contribution to from measurement error alone. You also need to know the contribution due to technician bias. There is more than one way to find this out. Imagine that each of the measurements a technician made were set to the technicians own mean value, i.e. that each technician had made fifty identical measurements. If this were done you would have eliminated variation due to normal error and the variation left would be down to the technicians. You might then repeat the calculation for this block of data. However there is a simpler way to do it without making up a big table of figures. If you calculate for the row of four means you can simply multiply by sixty to get the full figure. So calculate =DEVSQ( insert an appropriate reference to the four means here )*60 You now have three values for You should see that the two contributing values (for error and for technician bias) can be added together to make the total for all the data. So in reality you could choose to work out any two values and find the other by subtraction or addition as appropriate. On the other hand, by calculating all three values you have got a way of checking your calculations. What next? You must find the variance for error and variance for technician. To do this you need to divide by degrees freedom. You have to be careful here. Degrees freedom for the error is ( ??? ) , degrees freedom for technician bias is ( ??? ). Now is the time to compare the two contributory variance values. Divide error variance by technician variance. This value is called f. You have to obtain from this the probability that variation between technicians' means is due to chance alone. Do not look up f in a statistics table. Statistical tables are provided only because the necessary calculation is too complex and time consuming to perform on paper using a pocket calculator. You have a computer to help you and the spreadsheet program is capable of the doing the proper calculation for you! The function must account for both degrees of freedom so it will take the form; =FDIST( reference to your f value, reference to error d.f., reference to tech. d.f. ) This gives you p exactly - it is not an estimate. If the value is low then the null hypothesis (that variation between technicians' means is due to chance) is unlikely. If it is lower than 0.05 then you can confidently discount the null hypothesis. If your value is greater than 0.05 then you must assume that without any further evidence to the contrary your technicians are working reliably. If there is a significant discrepancy between the means what can you do to satisfy the client? Look at the four means and the histograms you produced. Can you spot a single technician who is at odds with rest? If you can't, you will have to carry on regardless, but if there is one technician who's mean is obviously deviating from the other three you can choose to disregard the data from that person. Repeat the ANOVA using the remaining three technicians' data. Are the three means still at odds with each other? Repeat the test of the mean against the stated value on the packaging using an overall mean taken from the three technicians' data. Comparing Batches of Margarine You can use exactly the same technique (one way analysis of variance) to find out if batches of margarine have mean values which are significantly different. Of course you will have to arrange the data in columns differently, so start with a fresh spreadsheet, and remember that the degrees freedom will be different. If you have chosen to discount the data of one of your technicians you will be using a smaller data set for each batch. If the batches are significantly different you will have to compare each batch against the value on the packaging separately instead of the overall mean. Routine use of ANOVA The spreadsheet program you are using has a built in facility for ANOVA. You have set the calculations up manually so that the client could see all the intermediate values but using the facility provided you can get instant results. You must still arrange the data into columns as before, but this time select the entire block and choose the Analysis Tools command on the Options menu. A dialogue box will ask you which analysis you wish to use - choose one factor ANOVA. A second dialogue box will ask you questions about your data and how you want the test done. Use the help button on this box to find out what the options do. Two factor ANOVA In some studies you may choose to treat the two factors affecting the mean together. In other words look at the technician bias at the same time as batch variation. There will now be more than two contributors to variation in the measurements. There is measurement error, technician bias and batch variation. In addition there may also be an interactive variation if a technician introduces a different bias on different batches! This seems very unlikely for our study but is a distinct possibility in less clear cut studies. There is no need to do this type of analysis for your client but you should be aware of the possibility of using it in the future. The Report You need to word process a report for your client. Tell them whether the margarine fits in with the value stated on the packaging. If you had to do this with batches of margarine separately explain why. If you had to ignore data from a technician explain why you did this - your data set will be smaller than that paid for by the client so your reasoning must be well thought out. At all points state the statistical test used and the key statistical results. Quality Assurance Audit Your company has a policy for running quality assurance audits. Your study has been picked out for audit. You must provide all the documents connected with the project, every spreadsheet file and a copy of the report, before sending off to the client. (Using Email may be a bit of strain on the computer system with all these files so check with your course tutor about this first.)