Welcome

Our website is made possible by displaying online advertisements to our visitors.
Please disable your ad blocker to continue.

Management Engineering - Quality Data Analysis

Completed notes of the course

Complete course

QUAL=TY DATA ANALYS=S “Se trovi utili questi appunti, considera di fare una donazione per supportare il mio lavoro e contribuire a mantenere disponibile questo materiale per tutti gli studenti. Ogni contributo, grande o piccolo, fa la differenza!” “If you find these notes helpful, please consider making a donation to support my work and help keep this material available for all students. Every contribution, big or small, makes a difference! ” Link: https://gofund.me/f92033eb SYLLABUS 1. =NTRODUCT=ON ................................ ................................ ................................ ................................ ............................ 5 1.1. QUAL=TY OF DES=GN AND QUAL=TY OF CONFORMANCE ................................ ................................ ..................... 6 1.2. MEASUREMENT SYSTEM ................................ ................................ ................................ ................................ ....... 8 2. DATA MODELL=NG ................................ ................................ ................................ ................................ ....................... 10 2.1. MA=N ASSUMPT=ONS ................................ ................................ ................................ ................................ .......... 10 2.2. RANDOMNESS AND CORRELAT=ON ................................ ................................ ................................ .................... 10 2.3. TESTS FOR T:E ASSUMPT=ONS ................................ ................................ ................................ ........................... 12 2.4. =NDEPENDENCE TESTS ................................ ................................ ................................ ................................ ......... 12 2.4.1. R1N' -E'- FOR =NDEPENDENCE ................................ ................................ ................................ ................... 12 2.4.2. BARLE--’' -E'- FOR A1-OCORRELA-=ON ................................ ................................ ................................ .... 13 2.4.3. LBQ’' -E'- FOR A1-OCORRELA-=ON (ljiung box pierce) ................................ ................................ .............. 14 2.5. REMED=ES =N CASE OF NON -=NDEPENDENCE ................................ ................................ ................................ ..... 16 2.5.1 GAPP=NG ................................ ................................ ................................ ................................ ......................... 16 2.5.2. BA-C:=NG ................................ ................................ ................................ ................................ ...................... 16 2.6. NORMAL D=STR=BUT=ON TESTS ................................ ................................ ................................ ........................... 17 2.6.1. C:= -'Q1ARED -E'- FOR ANC D='-R=B1-=ON ................................ ................................ ................................ 17 2.6.2. ':AP=RO -==LK ................................ ................................ ................................ ................................ ............... 18 2.6.3. ANDER'ON -DARL=NG ................................ ................................ ................................ ................................ .... 18 2.7. REMED=ES =N CASE OF NON -NORMAL=TY ................................ ................................ ................................ ........... 20 2.7.1. BOB COB -RAN'FORMA-=ON ................................ ................................ ................................ ........................ 21 2.7.2. -E'- ON -:E EMP=R=CAL C1M1LA-= if no, = have to take some corrective action. 2.2. RANDOMNESS AND CORRELATION =n the stationary meanderin g pattern the overall mean is stable but the data are still not random -> typical pattern of autocorrelation. AUTO CORRELAT=ON − correlation measure the relationship between two variables (for instance the weight and the height) − autocorrelation measures the relationship of a variable with lagged values of itself (for instance x t vs x t-k) Observation : • Correlation and independence o =f �� and �� −� are independent → they are uncorrelated ( �� = 0 �� ) ▪ -he reverse is not true, because correlation is a very specific type of dependency, it's linear dependency, =f they are uncorrelated they could still be dependent → correlation is a measure of linear dependence but there are other types of dependence o =f �� and �� −� are correlated ( �� ≠0 �� ) → they are dependent ▪ -he reverse is not true, because dependency doesn't always imply autocorrelation • Correlation and causality: correlation doesn’t necessary mean that there’s a relationship cause -effect between the variables (for instance there is a test that show how the ice -cream consumption is correlated with shark attacks in Australia and so correlation doesn’t imply causation ! -here is a third variable that is the temperature and it explains the correlation of the other two variables) -> DOE; doing some experiments on purpose (di proposito) it' is the only way on observing causation by changing on purpose something you can observe the response change. 2.3. TEST S FOR THE ASSUMPTIONS =e can perform different types of tests to decide if data are appropriate or must be corrected 2.4. INDEPENDENCE TESTS 2.4.1. RUNS TEST FOR INDEPENDENCE Objective: checks if the pattern is random and so if the data are independent . At the beginning we can also make some visualization of the data through time series plot. if the process is random, the number of runs observed on a large number of samples will be (approximately) distributed as a normal with mean E(C) and variance C ~ N(E(C), i reject the assumption of random pattern) o R1N: sequence of successive and equal symbols that precedes a different symbol (example: +++ --- ++ -- = 4 runs → we could have 2 extreme situations that are + - + - + - + - + -= 10 runs or +++++ ----- = 2 runs) 3. :ypothesis testing: o :0 : the process is random so the number of runs is random → � ~�(�(�),�(�)) with: E(C) = µ(C) ; (n-m) is the number of minus . =ith E(C) and the real autocorrelation is a number (deterministic and unknown)while the estimator is a random variable since it is function of random variables (Bt) . ��: estimator of the autocorrelation at lag k 2. 'et the hypotheses: o :0: �� =0 → there’s no autocorrelation at lag k o :1: �� ≠0 → there is a correlation for a certain lag 3. Calculate the rejection region:| ��|= ��/2 ⁄√� -> = can write down the rejection region with the module because the mean is zero 4. 'et �=0.05 (��/2 =2) and see if the value of �� falls in the rejection region Example 50 is the number of observations. -he single row is the correlation between B t and B t-k, in this case we repeat the barlett test for each row and so for each lag until lag 12. Pay attention! when you are dealing with multivariate data, so a lot of different quality features, lot of different parameters.. the bartlett test is what you need, to control what is going on as all. Observation : the test can’t be used for different lags at the same time . ➔ when conducting multiple analyses on the same dependent variable, the chance of committing a -ype = error (reject :0 even if it is true) increases, thus increasing the likelihood (probabilità) of coming about a significant result by pure chance → Bonferroni correction (the alpha= 5% is good only if we are doing only one test and not twelve testslike in the example) Bonferroni inequality : we assume that we have N hypothesis tests (i=1,2…N) (N for instance is the overall number of windows in a house with alarm). • Each test has its own probability to reject ��0�� when it is true → �� • -he family wise first type error is �′➔ it is the probability of rejecting at least one null hypothesis when they are all true : ➔ for independent tests it can be shown that =here (1 - α) is the probability to have not false alarm in the test -i, while α’ is the probability that at least there is 1 false alarm in N tests , and so (1 -α’)= P(no false alarm in N tests) • =f we set the same � for all the tests ( �� = � ∀��) → �′ = 1 − (1 − �)� → � = 1 − (1 − �′)1/� ➔ we can so build intervals to constrain the family error rate: o Chose the nominal family error rate �’�� o :ow to use Bonferroni ' inequality in a practical way: − set -> α'nom = '1M( αi); − so, the real -α’ < α'nom − and then to define the single αi -> αi = α'nom /N (!!!each windows as a low sensitivity, since αi is low and so huge beta error ) Bartlett’s test for more lags : if we have L different lags (k= 1…L) we set �� = �� /L 2.4.3. LBQ’S TEST FOR AUTOCORRELATION (ljiung box pierce) Objective : check if there’s autocorrelation in the data Concept: Q is the test statistic ; n is the number of observation . =f at least one autocorrelation coefficient is different from 0 the Q is higher than 0 and Q is an high value -> for this reason the α is not symmetric in the distribution (only one side) and the rejection region is on the right. 'teps: 1. 'et the hypotheses o :0: �� = 0 with �� = 1… � → there’s no autocorrelation (with this test in one shot = test all the lags!) , if :0 is true -> Q ~χ2L o :1: ∃�� ∈ [1; �] �.�. �� ≠ 0 → there is a correlation at least for one lag 2. Calculate the rejection region: � > χ2�,L = reject :0 only if Q is on the right of χ2�,L. Example : Link bartlett’s test and LBQ: =s there any connection between the test statistics of bartlett test applied at lag 1,2,3...L repsect with the test statistic LBQ? =n real life = should always try as many, of the methods that = know, as possible -> never do only one method/test 2.5. REMEDIES IN CASE OF NON -INDEPENDENCE Assume that the runs test rejected the hypothesis of random pattern in the data, what can we do? 2.5.1 GAPPING One way to get rid the autocorrelation is what is called gapping. Gapping: reducing the sampling frequency (subsampling) → if = have autocorrelation between �� and �(�−�) (k=lag) = can subsample my dataset by taking one data each k . Let's just keep one out of N data -> so we move from the full set of data on a sub data and usually we can break some autocorrelation and make the dataset random. Example : we can have �1,�2,�3,�4,�5,�6,�7,�8,�9,�10, �11, �12,… �� and we have high autocorrelation at lag 10 we can take one data out of 10 → the new dataset will be �10, �20, �30, … �� Observation : we risk to lose some informations and not having normality because we are reducing the dataset (central limit theorem violated) . 2.5.2. BATCHING Batching: to remove autocorrelation we can divide the dataset in sequential batches that are not overlapping and for each of them consider the sample mean -> = have j batches of b observations . �̅ ~ ? ( µ, σ2/n), it is true only if B i is iid with mean= µ and sigma= σ. Furthermore i f the number of data is huge (i is high) = can apply the central limit theorem and the ? become N(normal) . Example chemical process: = have 1000 observations that = batch each 10 → = get 100 batches of 10 (b) observations each . =n the new dataset = will have 100 values tha t correspond to the sample mean of each batch. =e can see from the autocorrelation plot that the autoc orrelation has been removed . Observations : − =ith this method = can get rid of the non -randomness and the non -normality of the data because we can apply the central limit theorem to the batched dataset − Disadvantage: difficulty to define the appropriate value of b (batch size , the window )→ empirical approaches Empirical (iterative) approach to determine the batch size: 1. =nitialise � = 1 2. Compute the autocorrelation coefficient at the first lag 3. =f the coefficient is smaller than 0.1 go to step 5; else step 4 4. 'et � = 2 ∗� and go to step 2 5. End Observations on batching and gapping : − Both the approaches are applicable only to stationary processes (constant mean ) -> in case the mean is not stable, non -stationary (there is a trend) the batching and gapping is not working anymore in order to remove the non -randomness. − Both the approaches induce loss of information − -hese are approaches that do not tackle the autocorrelation issue, we simply find a way to avoid it, instead of dealing with it. :ow can we “identify” the appropriate model in case of non -random data? o Regression o AR=M A 2. 6. NORMAL DISTRIBUTION TESTS -he first thing to do is always use graphical tests (plot the data) to get what type of distribution the data are following: − :istogram → symmetric − Boxplot → to see if the distribution is symmetric ➔ this is not an assurance of normality, it could be just a symmetric distribution → need quantitative tests Goodness of fit test: the goodness of fit (GOF) tests measure the agreement of a random sample with a theoretical probability distribution function → we can make tests for each type of distribution we think it might be . ➔ the procedure consists of defining a test statistic (i.e., a random variable that is calculated from sample data to determine whether to reject the null hypothesis → the test statistic compares your data with what is expected under the null hypothesis ). =e’ll see: • Chi -squared tests -> we will never use it, because it is a very rought test and to use this test we make discretization . • 'hapiro -=ilk • Anderson Darling 2.6.1. CHI -SQUARED TEST FOR ANY DISTRIBUTION Objective: determine if the data follow a certain distribution → very flexible Concept : the idea is to compare bin by bin , compare the height of the bin of our distribution to the one that comes from the model: the more the two heights are similar the more the fitting is well approaching the histogram → this test is applied to binned data with k=number of bins ( � = 1 + log 2�) '-EP': 1. Define: o �� = observed frequency in class i (each class is a bin) o �� = expected frequency in class i ➔ probability that our variable C is in the limits (probability to be lower than the upper limit of the i bin - probability to be lower than the lower limit): �� = � ∗ (�(� ≤ �� ,��) − �(� ≤ �� ,��)) = � (�(�� ,��) − �(�� ,��))= expected number of occurrence inside the bin -ith. with: ▪ � = sample size ▪ �� ,�� = upper limit for the i -th class ▪ �� ,�� = lower limit for the i -th class 2. -est statistic: 3. 'et the hypothesis: o :0: ��2~��2�−� with: c = number of estimated parameters → data follow a given F distribution (F can be whatever distribution we want) (we have a normal distribution with c=2) o :1: data do not follow the F distribution 4. Calculate the rejection region ��2 > ��2�,�−� ➔ graphically: compares binned data (as in histogram) to the line of the distribution 2.6. 2. SHAPIRO -WILK -> in practice we never use this form ula, we applied the test on the software without do anything -he 'hapiro -=ilk test, calculates a = statistic that tests whether a random sample, �1,�2, …, �n (my data set) comes from (specifically) a normal distribution. 'mall values of = are evidence of departure from normality and percentage points for the = statistic, obtained via Monte Carlo simulations, were reproduced by Pearson and :artley. -he = statistic is calculated as follows: where the x (i) are the ordered sample values (x (1) is the smallest) and the a i are constants generated from the means, variances and cova riances of the order statistics of a sample size n from a normal distribution. 2.6. 3. ANDERSON -DARLING = 0 → �(�) = �� :we just have to find λ and for this we use the box cox plot . ➔ if we get � = 0 this means that we have �(�) = ln � � is typically [ -1; 2] =hat the box cox plot doe s is to try to find �, so that the sample is as close as possible to a sample coming from a normal distribution. -hen thanks to � we make the transformation. Example : in this example if we use λ �� is deterministic but unknown (the mean don't remain constant anymore) − �� ~�� (0, ��2��) → we are not assuming normality but the error is independent and so not autocorrelated (iid= independent and identically distributed) -here are tools that let us analyse the dataset even though the mean is not constant and make predictions . OBHEC-= the criteria is to try to minimize the error, we use the ''E. ''E is the sum of the squared differences between the observed data �� and the real value of the mean at the instant t (��) − �� → observed data − �� → unknowns but deterministic mean and depends on �t that is still deterministic and known − -here is the powe r of 2 because we want to avoid the effects of the sign, it’s possible also to take the absolute value STEPS: 1. �� is usually unknown and we want to estimate it. =n order to identify the model we have to assume the model "structure" for �� . Linear regression model it's not that you are fitting a straight line behind the model. Linear regression model means that the model that =’m assuming is linear in respect to the unknow parameters. Examples : Ex non -linear : µt =at+bt c -> this cannot be linearized because = have ln(at + bt^c)=ln(µ t) if it were µ t =bt c → ln(µ t) =lnb+c *ln(t) (linear) -> = can call ln(b)=b' for simplicity 2. Estimate the parameters of the model with the minimization of the ''E (M'E approach). Assume that the structure is given now we have to estimate the coefficient . 3. Check assumptions and model correctness . =e fit the model and then we check the model, if in checking the model we find that some of the assumptions are not valid then we go back and we try to find something different . (in the assumptions on the residuals are not verified we change the model) 3.1. SIMPLE LINEAR REGRESSION 'imple linear regression is a linear function of one single regressor ( �� = �), the number of regressor is p, so in simple linear regression p=1. PARAME-ER' OF -:E MODEL: the parameters of the model are �0,�1 → estimators: �̂^0 = �0, �̂^1 = �1 ➔ we have two cases for the simple linear regression 3.1.1. COSTANT MEAN: Yt = �0 + �� =t is a general model, where ��= � -> Assume the true model is �� = �0 'ince we have a constant mean and just one regressor we have �� = � + �� = �0 + �� → estimated model �̂= b 0 -he problem is that = don't know �0 and = want to apply the concept to minimize ''E in order to compute the estimate value of �0 (�̂^0). = use the derivate since = want to find the minimum -> b 0 is the parameter that = want to minimize. -> if there is a trend in the mean this model does a bad job and so we need to use another model. 3.1.2. LINEAR TREND Y t = �0 + �1xt + �� Remember the first step is to assume the model structure . 'ince we have a linear trend and only one regressor we have �� = �� + �� = �0 + �1�� + �� → estimated model �̂� = �0 +�1�t xt is the regressor or predictor , in the example it is just the time x t=t. From different sample = obtain different line and so different value of �0 and �1 -> �0 and �1 are random variable. �0 is the intercept. �1 tells me how steep the line is. �̂� = �0 +�1�t -> is what that = estimate from my data, it is useful only for a define range of data , so pay attention in prediction and extrapolation. -he further = am in the range of my data, the more the uncertainty increases. -o find �0 and �1 we have to minimize the ''E : 1) we rid the -2 because we have 0 on the right part of the equation. 2) = have substituted the formula of b 0 that = have found before i n this equation. no one know the reality ! -here may be some discrepancy between the real value on the intercept and the one that =’m estimating , the same for the slope . =e can do two thing with our model (on the right): − interpolation: = compute the value �^� between two observation − extrapolation: the expected value of the response outside the interval where we have been observed data -> we feel more comfort with interpolation but it is usually more useless. Pay attention: -> we know =f the true model is equal to the assumed one: − 1nbiased estimators: E(b 0) = �0 ; E (b 1) = �1 − Min variance estimators (among all the unbiased estimators) 3.2. MULTIPLE LINEAR REGRESSION = have k regressors: �� =[�1�,�2�,..��] x1t can be equal to t and x 2t equal to t 2, the model is still linear because we look at the parameters ( �1 ,….. �k) − p = #regressor, in this case k=p − n= all data observed (i=1.....n) − K=p+1 (we add +1 because we have the intercept( �0) that it is not multiplied for a regressor) General model: �� = �� + �� = �0 +�1�1� + �2�2� +⋯�� +�� → we use a matrix for the notation (“design matrix”) =n the matrix for simplicity we have moved the index t of the regressors at the first position (t range from 1 to n) . For each instant t we have k regressors, and we have n instants of time. y is the vector of data observed. Estimated model: PARAME-ER' OF -:E MODEL: in this case we have a vector of parameters � =[�1,… �� ] → estimators: �̂^ = ( ��′��)−��′y • (��′��)−�� exists (it is invertible) only if the regressors (columns of x) are linearly independent: no column of B is linear combination of the other colum ns Example of regressors not linearly independent : • -he estimators are unbiased �(�̂^) = � var( �̂^) = ��2((��′��)−1) all the noise terms observed at different t are not dependent and are uncorrelated (no covariance). = know everything about the distribution of the estimators( �̂^) 3.3. TEST OF HYP OTHESIS =hen = have to do a linear regression = have to choose the model that fits the distribution of the observed variable �� → to do so we have to make hypothesis on the value of the parameters of the model. -> in -E'-1 = have one test for each parameter . -> if in the -E'-2 = accept :1(alternative hypothesis) , = ask to me which are the useful parameters . 3.3.1. TEST 1 :CPO-:E'=': =f = don’t reject :0 ( = accept :0) = can say that the regressor does not influence the response and so �� is not relevant to explain Yt, if we reject :0 the regressor influences the response! =N CA'E OF '=MPLE L=NEAR REGRE''=ON : -E'- '-A-='-=C: =f |t 0| is larger than tα/2, n-k we are going to reject :0, if p -value is smaller than α = reject :0 Example deming data : -> = can reject the null hypothesis since the p -value < α (0,05) and so the slope is not equal to zero and we can say that the time (regressor) influence the response . b1 is very close to zero but we have proved that there is statistical evidence that �1 ≠ 0, this is possible because also my observations (y t) are close to zero (= can multiply all the measures per 1000 and in this way the data look better) -CP=CAL O1-P1- OF L=NEAR MODEL: =e never trust a point estimat or without the uncertainty attached to it. DF: degree of freedom 1 = p= number of regressors in my model. 48 = n – K = n – (p+1) 49 = n – 1 =f = have 5 parameters but 1 is insignificant, = can throw out the parameter but then = have to repeat the test with the other 4 variables! -> if = have 2 parameters insignificant, = can elimin ate max 1 parameter at time. 3.3.2. TEST 2 Assume multiple linear regression model: �� = �� + �� = �0 +�1�1� + �2�2� +⋯�p�p� +�� = can make a test of hypothesis to evaluate each of this coefficient -> �1 =0?; �2=0?; … ; �p=0? For each �i = can do a test : -> test for one specific coefficient , one at time (similar concept of bartlett) But =’m interested also in a test that testing all the �i at the same time and so = need to change the null hypothesis: -> we are trying to establish if the observed variable is related to at least one of the regressors . -he null hypothesis says that the model is a constant mean model. -> this test is based on ANOVA (analysis of variance) • =hen the ''R is high with res pect to ''E this means that the regression is important, is meaningful -> so = reject :0 • =hen the ''R is low and similar to the ''E we are led (portati) to accept the :0 Mean squared error (mean square error) =f :0 is true ( �1 =0; �2=0; … ; �p=0) =f the value of F 0 is large, we reject the null hypothesis! Large F 0 means ''R>''E and so the slope is relevant and probably we reject :0. =f :0 is false we have a model with great difference between �̂� and �̅ → �� will be very large =f F 0 is small, =’m in the opposite case, in which we can just take the overall mean as a very good predictor or regressor or fitting value for all the data because the slope is quite in significant. =f :0 is true we have a model where �̂� will be very close to �̅ →�� will be close to zero OBSERVAT=ONS: =n case we have one single coefficient the two tests are identical :ypothesis: • :0: �1 = 0 • :1: �1 ≠ 0 -est 1 and test 2 are base d on one assumption, that the noise = εi~N(0, σ2ε) and iid. =e need to check this assumption because otherwise it’s not true that t 0 is distributed as a t -student and the in the anova test is not appropriate using the F distribution for the test statistic -> we check the assumption after the test is performed. → we check the residuals 3.3.3 . R 2, R 2adj R-'Q1ARED: it is a measure of the percentage of variability observed in the data that is explained by the estimated regression model. =f ''R is large it means that the linear model is useful to describe the data variability. =f ''R is small it means that model is inadequate to represent the observed variable Yt �2 is a random variable because is a function of random variables →we don’t have any possibility to know a threshold to “accept” the model. Pay attention: R 2 always increases if we add more regressor s in the model, so = cannot just judge the R 2 in deciding if a regressor is working or not -> we always use the test of hypothesis to judge. =f we just look at the value of R 2 we risk overfitting! Adding regressors over and over the ''E is going down, even if the new regresso rs are meaningless, for this reason we should never look to the R 2 but to look to the R 2adj. -> if w e are comparing models with the same number of regressors we can look at R -squared, in other cases is better to look at R -squared adjusted . -> if = add regresso rs that are meaningless the R 2adj is going down . Example : Attention to influential data point: 3.3.4. LACK OF FIT TEST For each value of the regressors we have more than one replicates . -> this is something that we will face very rarely because most of the time for us the regressor is the time (t) , so we have only one value for each time interval . =e can arrive to this condition by grouping for instance the data of each day . -o check if the model is correct or not we can take a look to the lack of fit test . =t can also say something to the structure of the model that we are assuming. 3.4. CONFIDENCE INTERVAL AND PREDICTION INTERVAL =hat =’m going to do next, for the future? B0 is something where =’m not observing anything, it is the next time instant. (x0 can b e also inside the initial time window) =hat is the best point estimator ? =’m going with the straight line -> µ^0=b 0+b 1x0 CONF=DENCE =NTERVAL OF T:E MEAN : the confidence interval is given by 1 -α, = can identify a band, where = can expect with 1 -α probability to observe the next expected value. -he real µ 0 Є [ Low, 1p] -> =f = move x 0 very far from x _bar , what is going to happen is that the interval increase in terms of variability. = can move myself also in the past, if x 0 in the past is close to the last data observed the uncertainty is lower respect than to take x 0 far from the last data. 'o, if -> amplitude of the confidence interval goes up -he confidence interval is minimum at x _bar and it is going to increase its amplitude far from x _bar . =hen =’m moving far from where = have been observing data my confidence interval is going to increase the size. PRED=CT=ON =NTERVAL -he question that =’m doing at this point in time is, having observed all my data in the time windo w and choose the new time instant (obviously = can do the same exercise also inside the interval where =’m been observing, maybe between two observations) , =hat is the interval where = can expect to observe the new y with the given probability? -he structure of prediction interval is very similar to the confidence interval -he additional 1 is the only thing that change respect than the confidence interval . -he prediction interval is much wider be cause there is the additional σ2ε that represent the variability of the data that =’m been observing . Most of the time we see inside the band, which is the prediction interva l band, the point that we are been observing because this band is exactly suppose d to be able to represent where we expect with high probability to observe the data . 3.5. REGRESSORS SELECTION Assuming that =’m observing a data set -> y t (data ), xt1(regressor) (t=1….n), but =’m not total ly sure if considering additional regressors (x t2=t2, x t3=t3, x t4=t4). =s there any way for deciding if = need to just use a simple trend or include also a quadratic pattern or even include some sort of exponential decay? 'o, for instance = can have one of this model: -> what is the best solution? =e use a heuristic solution that is called stepwise selection ! :euristic means that the method is not providing me with the best model overall, but it is a good compromise to solve in a simple way the problem. (we can also look at the R 2adj) 3.5.1. FORWARD SELECTION Forward selection is a sequential procedure where one variable is added at a time → at each step, the variable that provides the better contribution to the “fitting” is selected and once the variable is added, it cannot be removed in the subsequent steps . Example : = can start selecting one among these three guys (which using one regressor) → One regressor should be included if |t 0| > t α/2 (test 1) =f we have more than one regressor that satisfy the condition = select the one that provides the better contribution to the “fitting” and so the regressor with the highest |t 0| . -hen, once = have been deciding that maybe the best model out of the three is C t = β0+β 1t, we check if we need to add another regressor : -hen, once = added the second regressor = check also if = need to add the third one: 3.5.2. BACKWARD SELECTION =e start from a model that contains all the possible regressors that we suppose to be relevant and each step we are going backward by deciding which is the regressor to rid. At each step we remove the variable that is “less usefu l” to explain the data variability (lowest |t 0| ). Once a regressor is removed, it is nev er reincluded in the following steps. 3.5.3. STEPWISE SELECTION =t combines forward and backward selection. =e start as a forward selection but each time that a variable is added a backward step is carried out to check whether a variable has to be removed. -he procedure stops when no regressor has to be included in the model and no regressor has to be removed. -> we need to specify Alpha to enter and Alpha to remove (usually =) OB'ER -he p -value needs to be lower than � to enter just by adding 1/t we have been changing the p -value of the previous regressor t we can observe that the p -value of the first regressor(t) is changed respect than the previous step . -> useful information for the next step '-EP 2b BACK=ARD EL=M=NA-=ON: in this case we have to see if one of the regressors in the model we chose has a p -value > � to remove and remove the regressor with the highest p -value -> =n this example all the p -values of the regressors are lower than � to remove (0,003 and 0,006) so none of them should be removed '-EP 3a FOR=ARD 'ELEC-=ON: we have to choose one of the regressors we haven’t still used ( �2) and check if its p -value is higher or lower than � to enter -> =f the p -value of �2 is higher than � to enter it should not be included in the mode l p-value > α to enter, so we don’t add t 2 (because we cannot reject : 0; βi=0) F=NAL MODEL: we can now compute the final model for the regression (once we have selected all and only the relevant regressors) OB'ER0,05, we cannot reject the null hypothesis and so the et are no -autocorrelated. - :0: ρ = 0 - :1: ρ # 0 Assume that α=10% for the normality test -> we should reject in this case the assumption of normality. -he way of dealing with non -normal data is to transform the original data and to find the best transformation of the original data. Key point: if we have a failure of one assumptions of the residuals, we should not transform just the residuals but we should go back to the original data and to try to find a transformation of the original data . =e will see with this example that going back and transforming the medication error, finding the best transformation of the medication error this can even change the regressor model and so even the regressors that we need to use. -he p -value of the normality test on residuals (with the model that use only t as regressor) is 0,933 -> we cannot reject the assumption that the data are normal distributed. 3.6. OTHER CONSIDERATION IN THE REGRESSION MODEL 3.6.1. QUALITATIVE PREDICTORS Linear models and especially linear square regression can use quantit ative regressor (like t, t 2, 1/t) but we may want to check some qualitative regressors (also called categorial predictors or factors variables ), variables th at are not continuous variables. Example of qualitative predictors that can influence credit card data : gender, student, status, eth nicity. xi is a regressor that represent if the data coming from male or female. =t can assume the value either 0 or 1. -> at the same time we fit two different models. 0,6690 -> the regressor is not significant, we accept :0: βi=0 , so we don’t need to introduce this regressor and there is statis tical evidence to say that there is no difference between credit balance of male and female. 3.6.2. QUALITATIVE PREDICTOR WITH MORE THAN TWO LEVELS A qualitative predictor with more than two levels is called Dummy variable. =n the case of credit card , a dummy variable is ethnicity that has 3 possible categories /levels (Caucasian, Asian, AfricanAmerican(AA)) =f = have three possible categories = need to introduce 2 Dummy variables! -> #DummyVariables = #Categories_of_the_variable – 1 -> =’m not introducing x i3 (additional regressor) because if x i1 and x i2 are at the same time equal to zero the person can be only AA. xi3 is a linear combination of the other two and for this reason = don’t introduce it. One significant use of the dummy variable is for seasonality . :OW TO MODEL A RESPONSE T:AT DEPENDS BY T:E DAY OF T:E WEEK? 3.6.3. INTERACTIONS 'ometimes = can be even interested in the effect of some sort of non -linear effect that is called interaction . =f something happens to one variable it affects what ’s gonna happen on the other one. Example : -> never consider interactions among the dummy variables (x i3 * x i2-> no!) :OW TO MODEL AUT ORRELATED PROCESSES? Autocorrelation with linear model (with some “tricks”) A1-OREGRE''= it looks like a linear model, it is very similar. Do you see any specific problem in this case? C t-1 is acting like a regressor , but C t-1 is a random variable, in our previous cases the regressor was deterministic. The trick is assume 3.7. GENERAL MODELS -he linear models are really flexible tools and so the key point is that in principle = can represent my observed data as function of some regressor (f(x)) plus a noise term. 3.7.1. CUBIC SPLINE Given C = f(x) + ε -> knots are points in time -> = can choose them equally space in the interval or = can put the knots in the location there the points are nervous. -> cubic spine can be re -written as linear model, in all the coefficient that = need to estimate. APPL=CA-=ON: in statistic we use those functions in order to create a “regression” in the small pieces of the curve step . Piecewise constant: assumption that each segment between two knots has a constant predictor. Piecewise linear: not just a constant in each piece where =’m trying to do my fitting but also a slope. Continuous piecewise linear: = want to avoid the discontinuity so = impose the continuity. = can ask a little bit more, in each segment =’m assuming a cubic polynomial (order 3) . First derivate means slope, while second derivate means curvature. 3.7.2. CROSS -VALIDATION One very powerful idea to compare different models (for instance: cubic spline, polynomial fitting, cubic spline with different locations of the knots) is to use cross validation. MODEL the value of p depend how many φi are different from zero (the same for q with θi) -he AR=MA model is a very generic way of describing how data at time t can be affected by all of the previous data observed up to time t -p and all the other noise term observed in the past up to time t -q OB'ER E(x t)=µ (per ogni t) Another Notation: Theorem : -he AR(p) process is stationary if and only if the polynomial A(B) is stable , which means that all its roots lie strictly outside the unit circle in the complex plane . (we don’t do the proof) -> if the roots are outside = have stability of the autoregressive process . Case AR(1): One way usually of dealing with AR processes, but always in the AR=MA, it is some sort of recursive substitution: OBSERVAT=ONS: • Assume | φ1|1 . =f φ1 is larger than 1 what s going to happen is the opposite. -he weight of εt-i (φi1) increases as i increases, it means that the noise term εt-i is increasing its influence as its “age” increases. • Assume φ1=1 -> Random Walk RANDOM =ALK: it’s a particular type of non -stationary AR(1) process and it happens when we have ��1 = 1 -> =e can just briefly introduce non stable (non -stationary) autoregressive model that i t is what is called AR=MA(0,1,0). =here : o AR refers to stationary o = just refers to non -stationary AR -> the roots of A(B) are on the unit circle in the imaginary plane o MA is the moving average 4.2. AR(p) REGRE''OR': the regressors of this model are the autocorrelated factors: ��−1,��−2,…, ��−� PARAME-ER': the parameters of this model are represented by: ��1,��2,…, ��p Focus on stationary AR(p) : -> thanks to the stationarity there is a symmetry -he end of th e story is that when = will see the ρ^k , estimated by data with some sort of exponential decay in module = will suspect that this is an AR(p) model. =dentification of an AR=MA(p,d,q) means have some guess on p, d and q in terms of different to 0 or = 0, so some guess on the time series model we are facing . -> exponential decay of the module of ρ^k can possibly due to: d=0; q=0; p ≠0 -> AR(p) = AR=MA(p,0,0) Ex: = can have two different possibility: “BAD NE='”: these patterns are the same for all the AR(p) not just for AR(1), so how can = guess “p”, the order of AR(p)? Autocorrelation coe fficient of order 2 is coming out from fitting B ~t= β1* B~t-2 + εt , =’m directly considering the relationship between B ~t and B ~t-2 but without including also the full model of the previous terms, in terms of data observed in previous time instant. 'o when =’m fitting a model in which =’m just considering the direct dependence from B~t-2 , the guy that =’m fitting is exactly the autocorrelation coefficient plus some noise . -he partial autocorrelation is not doing this job, but it is fitting the complete model . Partial autocorrelation order 2: End of the story: for an AR(1) what is going to happen is that the autocorrelation function will look like the exponential decay (despite the order) or equivalently it will look like the sinusoidal one, but the partial autocorrelation function (PACF ) will show that we have just one guy that is significant of order k= 1 and all the other are zero or close to zero (not significant). =n an AR process the autocorrelation fu nction (estimated by data) is showing the exponential decay and by looking the exponential decay = can suspect that this is an AR process but the order of the AR process is coming from the sample partial autocorrelation in which the number of partial autocorrelation coefficient that are statistically different than zero are providing us with the order of the AR model . Example : =hat type of model is it? =hat is your identification of AR model? My guess is AR(2) , because = see the negative exponential decay in the ACF and the first two guys of the PACF are significant different than zero. AN-=C=PA-=ON: MA(q) is going to create the exact opposite picture , so we will able to recognize the MA(q) exactly on the opposite way. -he PACF let us see some exponential decay in module while the order of MA will be clear from the ACF. 4.2.1. VARIANCE OF AR(p) Example AR(1): -> identification means that =’m just guessing the structure , then the following step would be estimate all the coefficient and check that the residual terms are following the right assumption. -> =dentification thanks to ACF and PACF -> AR(1) -> then = fit the model and = estimate all the unknown coefficients -> check the assumptions (e t ~ N (0, σ2ε) ( partial autocorrelation coefficient ) Example of non -stationary -> random walk: it is a = process and not MA 4.2.2. AR(2) =f =’m able to estimate ρ^1 and ρ^2 , just by putting these two guys in the syst em of equation = can estimate the two unknown coeffi cients of the AR(2) process. 4.3. MO VING AVERAGE – MA(q) REGRE''OR': the regressors of this model are the shocks or the value of the noise at each time: ��,��−1,… ��−� PARAME-ER': the parameters or weights of this model that have to be estimated are represented by: 1, ��1,��2,…, ��q -> the only possible causes for non -stationarity can due to the AR= part in the AR=MA, not from the MA part. =f = have also the intercept: 4.3.1. VARIANCE OF MA(q) ACF allows us to identify the order of MA(q) 4.3.2. IDENTIFICATION OF MA(q) Order q can be identified thanks to the 'ACF . -he exponential decay of the absolute value of the 'PACF can make us guess that it is a MA process. MA processes are very frequent in chemical and food processes. 4.4. ARMA PROCESS Let’s put all together : 1sing the B (back shift operator) we can re -write in simple form and assuming stationarity: STEPS FOR T:E CREAT=ON OF T:E MODEL: 1. =dentification : -ry to identify, so if =’m assuming an AR=MA(p,d,q), identify means make a guess on p,d and q . Once we have decided that the variable is autocorrelated and has to be modelled with ARMA (p,q) we need to decide the value of p and q that characterize the model (how many parameters are in the AR and MA part of the model) 2. Estimation of the parameters : Once = have been identifying the structure, fit the model , fit the AR=MA -> max likelihood estimation -> so = can compute the residuals et =B t – B^t ; and the variance of residual 3. Residual check : et not autocorrelated anymore , = should basically check the assumption of iid and possibly the assumption of normality. -> =f we fail the third point we need to go back at point 1 and try to find another guess for the AR=MA PARS=MONY PR=NC=PLE : try to have models where p+d+q is low -> prefer simpler model. -he simpler the better! (avoid overfitting) 4.5. ARIMA(0,1,0) – RANDOM WALK – I PART -he integrated model is included when we have non stationarity in the process → we have an integrated term when some of the �� are equal to 1 in the AR(p) part of the model. -he formulation is the same as a AR(p) model but the value of �� are equal to one and this imposes non stationarity there is basically a trend that continues to increase the next observation’s value Once = see linear decay in the ACF usually means non -stationarity. -he way to solve the AR=MA is avoid to work with B t and so we need to work with the difference between Bt-1 to B t . -> B’ t= B t - Bt-1 =DENT=F=CAT=ON OF d how can = identify the d (the order of non -stationarity autoregressive part)? 1sually we just apply nabla, at sequence of time, until we find stationarity. Example: =arning: pay attention to over -differencing , so apply the nabla operator too many times and so one suggestion is for each time =’m applying nabla consider also the variance of the resulting time series. Focusing on the minimum value of the variances we should find the right d 4.6. ARIMA(p,d,q) – GENERIC CASE Example : one very common process used in continuous manufacturing processes and chemical processes is -> =MA(1,1)=AR=MA(0,1,1) Example: Pay attention to avoid a “blind” application of the procedure! 'ometimes when = have a sequence of data where mean is jumping(mean shift) if =’m looking to the A CF it looks like an AR process but it doesn’t have anything to do with the autocorrelation! -> Before looking to the AFC = should recognize that = have two different guys in time series plot. ! look to the data at the beginning and then starting to apply the procedure. Example: Example: 4.7. STEPS FOR DEVELOPMENT AN ARIMA(p,q,d) MODEL if = reject the assumptions on the residuals and so the residuals are not iid = have two strategies: 1. = go back and = define a different model 2. = take the residual and = start the problem like a new model. 'o let me try to fit an AR=MA model on residuals 5. PRINCIPAL COMPONENT ANALYSIS FOR MULTIVARIATE DATA – PC A -DIMENSION REDUCTION PCA is one way of modeling multivariate data, so at the same time ='m taking a lot of variable, a vector of variable. PCA it is used a lot in machine learning tools. Assume that = have a couple of variables, we plot a scatterplot in which each point is given by x1 and x2, so for each time instant ='m measuring two variables, in this case is the age of the population and the spending. in this example in can see some correlation. -he basic idea behind the PCA is: can = find a new reference system? so moving the origin O in a new location O ’. -> O’ = [ µ1, µ2]’ -he new reference system is called Z1 and Z2, these new guys are the principal components. = can describe each of these new guys as a linear combination of the previous data . 'o , in general Z i is a linear combination Bj’s (j=1,2). = want to find the best new reference system. = have been choosing the Z1 along the direction in which we see the maximum correlation between the two guys. ='m trying to find the direction where = see the most of the variability of the data that can be expressed, so Z1 is the new direction in which the variability of the all point cloud is basic better describe , while Z2, the remain principal component , is just describing the residual noise term . Assume that = have p variables -> [B 1……B p]. (=n the simplest case = can imagine to have two variables) =’m looking for a new reference syste m, O’, that is given after a translation and =’m looking for the new direction Z1 such that =’m projecting all the data in Z 1 direction . -he Z 1 is going to be the direction which keep the max variability , if =’m considering the projection, that = can see in the original data . Z1 is the new direction such that by projecting B’s in this direction = will represent the max variability of the original data. Z1 is capturing most of the variability from the original data (Z 2 is just representing noise) -> = can keep only Z 1=f(B 1,B2) to represent the information of the original B bivariate data. Key idea : each time = have B=(B 1, B 2) =’m transforming B in just one number Z 1 which is a linear combination of B1 and B2 (Z1=α1B1 + α2B2) and = will discard Z 2 because it is just noise . -he advantage is that =’m moving from 2 variables to just one variable -> Dimensional Reduction . =e can move from 1000 variables to just few variables that represent the most of the variability! W:ERE PCA TEC:N=QUE =S RELEVANT ? 1. Process quality data = have a process where a t each time instant =’m collecting a set of quality features (B 1…..B p) -> p random variables (Ex: -emperature at different location, humidity, oxygen, geometry printed) 2. Product quality data : = measure the product and not the process Ex: diameter, -emperature, 'urface, distance between two points 3. mix of the previous : at each time instant =’m collecting p random variables that are not supposed to be independent B is a p -vari ate random variable ; it is defined by mean and covariance -> → formula of correlation is wrong, this is √ρ! -his type of problem is the other side of “big data” issue . Big data: • 256 =p= pixel intensity in each image. Can = use cross validation, for instance, to select the #PCs? Cross validation: = can use the training set, assuming that = don’t have just 32 images but 100 imgaes. Let’s play the game of finding the best PCs just with 80 out of 100 images and =’m estimating what is the reconstruction with 3,4…7 PCs and then = can try to see how good =’m in predicting new image s, the 20 images that = have been bringing out. -hen = will choose the number of components to be retained that is minimizing the prediction error, so more capable of predicting future image that = have been able to take out of all the overall set of samples. =n this case what =’m doing is not working with the errors compu ted on the same images that = have been estimated in the PCs but =’m estimating the error on new image s, so some sort of prediction error and cross validation do this for me. PCA is good enough for simplyfing the classification of clustering problem and at the same time reducing the number of dimension . Dimensional reduction necesary implies information loss, and the less principle components you keep and the less variance explained by the principal components you keep, the more possible information loss you have. Remember, before to aplly the PCA = have to check the assumptions on the starting variable. =f one variable, for instance x_1, is autocorrelated = can find a good model and check the assumptions on residuals, if residuals are okay = replace the original variable “x_1” with the model residuals and apply the PCA on “residualsB1; B2 ; B3 …” 6. STATISTICAL PROCESS MONITORING – SPM 6.1 INTRODUCTION 'PM original was called statistical process control 'PC, but now everybody use to avoid 'PC because control is different in respect to monitoring. • monitoring means just observing the output looking for anomalies or stability or instability. • control means to use process parameters to keep the output stable close to a target. =e are assuming that the process parameters have been optimized, the operators know how the machines should run. we want to see if something strange is happening in time. =e are observing a process output, the output can be a vector or an image. ='m really interested in the valve (valvola) on the picture, it is open at a given time for picking a given sample of parts (sample size=n ) -he sampling frequency(h) is that you need to go there and decide how often you pick the parts . Sampling strategy means how = select the part to be measured. Once we have the data (after the measurement system we have a collection of data) the key point is the control chart that we assume that it give us an alarm if something going wrong. -hen we ask to us if is this alarm a false alarm or not? -o answer, we search for an assignable cause that has been creating an anomaly in my process. if = find an anomaly, so the alarm is a real alarm and not a false alarm , we can remove the problem and we can then redesign the control chart. if the alarm is a false alarm we don't do anything and we continue to collect the data. 6. 2. ASSIGNABLE CAUSE VS COMMON CAUSE (1): due to many different sources (material, machines, tools, operators, measurement systems, …). NB: also in Business process (2): due to some specific issues that can be easily identified (batch of material of poor quality, operator error, wrong setup, wear, etc.). 'ometimes positive effect (serendipity) Common causes are called natural variability and it something that always happening and there is nothing to be done. =hat we want to discover thanks to the control chart are the special causes (assignable causes) of variability and not the natural one . =f we don't have any assignable cause the violation is just a f alse alarm and so we keep the point. -he assumption behind the control chart is that the process is behaving with time . Example 'AMPL=NG '-RA-EGC: = decide to keep each our 5 parts and = will measure the diameter of the cylinders: h= 1 hour n= 5 parts we’ll see 3 possible sampling strategy : 1. take the last 5 parts produced after one hour by the machine. 2. take 5 parts , taken at random from the parts produced in the last ten minutes after one hour from the previous sample. 3. instead of just taking the 10 minutes = can even take random parts among the ones produced in the last hour . =hat do = prefer between the 3 strategies? there are a lot of sampling strategies with different pro and cons. -he question that =’m asking after each hour: has the distribution of B changed? is µ different from the in control µ or is σ2 different from the in control σ2 ? -> estimate µ and the σ2 and decide if they are different from the in control µ and σ2. -he end of the story is that we should estimate the mean and the variance in a short time period, we cannot split the estimate of the mean and the variance in a large time period because in this case even if something is change within the hour = would estimate something that is not r epresenting the instantaneous distribution of the diameter that =’m producing . -> strategy 3 is the worst case -he cons of strategy sampling 1 is that the parts can be autocorrelated, so usually we decide to go with an intermediate case that is strategy 2. 6.3. TEST OF HYPOTHESIS • :0: the process is in -control • :1: the process is out of control =e lead to 2 control charts to control the stability: − One for controlling the mean − One for controlling the standard deviation V=sample statistic on the control chart ; < is compu