logo
  • userLoginStatus

Welcome

Our website is made possible by displaying online advertisements to our visitors.
Please disable your ad blocker to continue.

Current View

Biomedical Engineering - Methods and applications of ai in biomedicine

Completed notes of the course - Applied AI in medicine

Complete course

CURRENT HEALTHCARE AND BIG DATA IN MEDICINE What has been happening in recent years is a major demographic transition. If we look the graph, we can see on the y axis the % of global population and on the x axis the different years. In red we have children (< 5 y/o) and in blue we have people over 65 y/o. There is a trend of decreasing children and increasing of other peop le. In 2010 there was an intersection, meaning that now the number of elderly people is larger than the number of children . On the other graph we see the % of change in the world population in the period 2010 -2050. People over 85 y/o will increase of 351 %, 65 y/o of 1 88 % and younger of only 22%. People are getting very older. The problem is related to the health issue and is that the health treatments is moving to noncommunicable diseases, that are heart diseases, diabetes, cancer. All this results in a growth in healthcare costs. This is not evenly spread in all the world: we have 37$ in Pakistan vs 12.703$ in US. But there is a continuous increase in health cost. So, there is a rising need for technology that can help to reduce these costs and could help clinicians to obtain better diagnoses. The good thing is that there also been an arise in health -conscious population, meaning that people are more careful to nutrition, fitness and in general some electronic devises that could monitor their physiolo gical parameters. The electronical health records are real time, patient centered, and they simply record all the information about patient and this information are accessible to all the doctors of different hospitals. So, electronical health records: So, for each patient there will be a very big amount of data that are accumulated over the course of life and can be used by a doctor or a software to obtain a better diagnosis and to better understand the patient conditions, hopefully defining the best tr eatment for each patient. New models in healthcare There is a recent report that lists the characteristics that healthcare model should have to have a good quality, access to care and efficiency. These are: One important thing to remember is tha t through AI we want to help clinicians, not replace them! There are some specific tasks that an algorithm can do: for example, it can read the biopsy and classify it in benign or malign metastases, b ut it will never replace a clinician . W e want clinicians to learn how to leverage AI to improve their performance . Value -based treatments The primary goal of healthcare is to create and maintain a healthier society and improve the lives of its citizens. Until few years ago, healthcare system p referred quantit y over quality, i.e., preferred seeing a large number of patients per day rather than decreasing the number but improving their treatment and diagnosis . Now it’s changing in a value -based model where the patient is focuses at the center of the scheme and the values can be define d as patient health outcomes generated per unit of currency spent. It must be underlined that a good health outcome doesn’t mean that you need more visits or tests, but value means better health status. Sometimes, the feeling is tha t if we do more visits, we will take a better care. Patient -centered care The individual is placed at the center. Around patients there are different specialists, nurses and so on. There is also a care management team. So, doctors don’t work alone, but there is a coordination. And another important thing is that the patient is involved, sometimes even in decisions. Think about oncological patients, sometimes even the doctors don’t kn ow what therapy is the best, so they talk with the patients and ask them what they prefer. This involving of the patient at the center this leads to a higher compliance to a treatment. Personalized medicine Nowadays, what usually happens is what we see in the first figure: we have a certain group of patients, homogeneous in disease (for example column cancer); there is a certain therapy to apply, and we have 3 possible responses. The first in which the people respond very well to the treatment, a middle group in which there is a stable line, so the therapy is doing neither good nor bad , and a small group in which the drug doesn’t work, and the conditions are getting worse . With future medicine we have the same group of patients with the same column can cer, but we run some tests b efore giving them therapy . These tests could include blood sample, DNA sequences, some image processing of MRI or CT. We put together all the results of these tests and we divide patients into different phenotypes, for example 3 subgroups and for each of them we have different therapy. Big data in healthcare First of all, big data is defined with this equation: log(n*p) > 7, where n = number of patients and p = number of variables. Big data are becoming very common. We can see from some research that in 10 years, the number of papers speaking about these big data have been on an exponential trend . There are various areas inherent to big data. The most common are research studies (we study the ECG of the patient, and we derive heart rate and others parameters – we perform MRI and derive other parameters, generic databases (with some information from economic status, where do you live, where do you work…) , wearable devices and smart phones ). One problem associated to the fact that you can get data from many sources is that data will very likely be unstructured and very full of errors. Unstructured means that they can be very different (they can be categorical features, numerical features, they can be tested and written by different clinicians), so you need to analyze them in a standard way as input of our software. And so there could be many errors due to devices that recorded the data or other things. So, a lor of preprocessing is required. This means that we have to l ook at the data and see that features are all associated to a particular measurement (for example, heart rate) and if we find for example 15 000 as parameter, this is an error of course! We need to develop some automatic strategies to look our data and fin d if there are some outliers or values not physiologically possible. Data can be: - Quantitative (numbers): vital signs, diagnosis codes, laboratory results, medications… - Qualitative data: clinical notes, medication order not es, discharge instructions . The y might contain more information but it’s more difficult to extract this information but putting all together we have a big input for our software. The fourth industrial revolution Now we are not in the prolongation of the third industrial revolution, but we are in the fourth. There are 3 mean reasons to differentiate these last 2: the velocity, the scope and the system input. There are 2 big approaches : the traditional machine l earning, where you extract some features, you decide and then classify them to obtain an output. And then there is the deep learning, where we just give the image, build an algorithm that automatically extract features and do a classification. The origin of AI The origin of AI is uncertain, depend on how you define AI. Some people say that we can think to AI close to robotics  3000 years ago with Talos who was the ancient mythical automaton who protect Europa (a Phoenician princess) in Crete from pirates and invaders. If we think AI close to reasoning  40 s when Alan Turing performs the imitation game, trying to answer the question “can the human distinguish if his/her interlocutor is another human or a machine?” . Also the definition of AI is foggy: - One branch of computer science dealing with the simulation of intelligent behavior in computers - The capability of a machine to imitate intelligent human behavior What can AI useful for? It can help in: 1. Problem solving 2. Knowledge, reasoning and planning 3. Uncertain knowledge and reasoning 4. Learning 5. Communicating, perceiving and acting What can AI do in medicine? - Support of the main decisional tasks (diagnosis, therapy, prognosis, planning) - Knowledge and data representation - Robotic tasks Example: We have this decision support system based on AI algorithm and the software suggests to the doctor a treatment plan based on ACE inhibitors against high blood pressure. Why the software gave this recommendation? Actually, what was found was that there was an unexpected reaction to the drug and this reaction was more relevant in African -American people. But the worst thing is that the clinical trials where they validated the algorithm was performed only on Caucasian subjects, so the population where the alg orithm was validated was biased. BASIC GLOSSARY How is machine learning defined? There are 2 definitions: Let’s see 2 examples. The first is not related to biomedical aspects. The second yes. In the second case the experience is given by the data from the patients. Some patients with some specific characteristics are more likely to have the diabetes. Data So, the data are the main ingredient of deep learning and machine learning. Especially i n biological fields are very useful. I t is important to have a large database available . Data are usually collected in data matrix. In our case we will have different samples , which represent patients or images (could be MRIs) . The simplest example is t hat each sample is a patient. And for each patient we collect different features , which could be very different (age, gender, drugs…). A very important feature is what is called class, that is the label used as a target. The labelled data is a big matrix which contains all the features + the class. All the features could came from different sources, so have different forms: they can be categorical variables (gender), numerical (heart rate) and so on. All the features from all the patients represent the ex perience, because the software try to learn the logic behind the risk of diabetes from these data. Supervised learning This uses all the labelled data. So, in the training we will know the features and the label. The training phase of the algorithm try to learn a function able to map some inputs into our target, the output. The relation between input and output is what the software tries to learn. In supervised learning there are 2 many different tasks, namely regression and classification. In this example we want to predict the blood pressure of the patient the following day. Regression and classification depend on the na ture of the output: if the output is a continuous numerical, we will have regression. In this example we are predicting the exact value of blood pressure (125 mmHg). While, in the classification problem, we have a discrete value, usually binary but also mu lticlass. In this case we have a binary one, low (normal) or high pressure. Same patient, same problem, but 2 different approaches. We will have a set m of training examples: So, x are our features. Then we will have a target function: What we want to find is a function f that is able to map our input to our output. So, we will want to find a relation between x and y. If we are able to find this function f, we will be able to classify a certain number of never seen data, data from test set. So, afte r we train our algorithm, we find a function that relates input and output, and we will relate that function to our test data.  We don’t want to find the exact function that maps only those patients’ data to their output because we want also to be abl e to generalize this function. In fact, if I present a new patient to the software, it will be able to predict the output. Supervised learning pipeline There are 3 key tasks. 1. Prepare to build a model: the model is constituted by 3 subblocks: define a task, collect data and prepare data. Define a task means that we have to specify and define the input and output of the model. Example of input: clinical variables introduced by the doctors (age, blood pressure…). Example of output: diabetes yes or no, dead or alive…. Collect data, so record data. In many cases we will not be involved very much in the recording phase, because we are not doctors. Finally, we will prepare data. We need to build our labelled matrix and then we will decide how we divide our data in training and test. 2. Train the model: the model will have some initial parameters and the model gives out a prediction for each input. This prediction is compared to the label, that we know. From this comparison the algorithm learns how th e parameters can change to make the prediction of the model more similar to the label. The algorithm iterates through the example in the training data to understand what relation is better. 3. Evaluate the model: from step 2 we define the best model for our task. So, we will have the final model and run the test set, which are data that were never seen by the model . So, this test set is run to the final model, the final model will give us some prediction that we will compare with our true label. But this is the final prediction, the behavior of the model. From this comparison we will build some performance matrices, like accuracy, specificity, sensitivity, etc.. If we look the model preparation, it is divided into 2 subblocks: data collection and preparation , dimensionality reduction . 1.a Data collection and preparation: this first step is very important, because poor data will give bad classification. T here is no way, if the data is poor, to get a good classification . There is a list of characteristics that our data should have to be good: - Accurate and valid : data recorded with acceptable parameters. For example, codes using in the hospital are in line with the standards. - Reliable and consistent : for example, the age of the patients recorded on each exam is the same in all hospital department. - Complete : in the sense that we have all the variables that we have supposed to have - Current and well -timed : they should be update. For example, we are to be sure that all the drugs given to the patients are updated - Accessible : available to authorized people like doctors, nurses, etc. This 1 st step is divided into other 3 sub steps: I. Exploratory data analysis: we do this by visualizing our data and perfor ming a statistic to understand how the data behave. Data visualization consists in a graphical representation of data in order to understand how features are spread. This may help in having an idea on how are distributed the data, if there are some relatio ns between them. We not only can understand how the dataset is composed but we can also realize if there are some outliers or missing data, or some patterns. Popular visualization tools: - Line plot - Bar chart - Histogram plot - Boxplot - Scatter plot Example: let’s consider 2 datasets, that have exactly the same mean (100) and the same standard deviation (20), but they have very different distribution. We realize that they have very different distribution observing the first plot: the green one has a n ormal distribution, while the purple has a logarithmic distribution. If, in the other hand, we plot the bars, we don’t see anything different. In the middle we have a boxplot, in which if we observe closely we realize that have different distributions: the green is equally divided by the median, while the purple has the median not in the middle and so there is a part bigger than the other. Another example: these 3 plots represent the same data with different values bins. In the first case we have a very wide bin. In the second a very narrow bins. And on the third we have an unequal binning. The ideal distribution is this. We see that we ha ve 2 main peaks, and this is an important information to derive. The suggestion for figuring out whether we are getting the right representation is to plot something with different bins, then different graphs, and then look for the best one for our model . Try to summarize the distribution and relationships between variables using statistical quantities. II. Handling abnormalities : abnormalities can be missing values, outliers and incorrect entries. We write a protocol with a doctor and decide to record E CG of the patient, together with a few clinical characteristics (gender, age, blood pressure). Missing value means that the nurse forgets to ask the age of the patient. So, the data matrix that we have seen at the beginning has some holes. Outliers means t hat data are very far from the average behavior of the group of patients. For example, we have an athlete with 50 bpm of heart rate because is very trained, but the others have 80 bpm. This could be an outlier. Incorrect entries are usually due to the fact that many of data are manually recorded or written. For example, instead of writing 80 bpm they wrote 800. How can we handle missing data? Let’s consider our data matrix. So, we have N patients (samples) and M features + the class. For the second patient , the second feature is not recorded (NaN). One possibility is to ignore this feature. The problem is that we don’t know at the beginning if this feature is relevant for our problem. Another approach might be to remove the patient who has this missing valu e. The problem is that we are removing data, which in general are very precious; the data matrix becomes less. Another approach is imputation: we can try to put i n place of the missing data something estimated. So, we are not removing either feature or pat ients. We might introduce some biases, so it can be a problem if these estimated values are many. There are different ways to estimate these values. The very basic is the derivation from all the other features, computing the mean of them. How can we handl e outliers? How can we identify them? From the representation of data, we can say that this is a normal distribution, except for some patients that have 0 value. So, these are outliers because are distant. Another approach is to build a model, for example a regression for observation. Crosses represent the values; we build a regression line. The cross on the corner, distant from the red line, is an outlier. To handle them, we need to decide if these value s are correct or not, because we are not often sure that are outliers. Ok , there are data that are far from the others. But are we sure that they are not right and are the result of errors? Even if they are real data, an outlier usually influences our mode l. So, even if they are real and correct, we have to handle them. We can directly delete those values doing the same steps we have seen before with missing data. III. Rescaling the data: can be performed either by standardization or normalization. We need to r escale the data because our data can come from very different sources, can have different order of magnitude. The standardization is usually performed by what is known as z-score normalization: all the features are rescaled with this equation which has the property of normal distribution, with ������ = 0 and ������ = 1. Normalization consists in rescaling all the features in the range between 0 and 1. So, all the features are normalized with this equation: In this way we obtain a fixed range of values for our normalized features. If we want to apply this normalization, we have to be sure that outliers are not present, b ecause otherwise we get only one value, and all the others about zero . Normalization doesn ’t give us bounded data. 1.b Dimensionality reduction: there are different types of dimensionality reduction, feature s election and feature projection. Feature selection: consist in selecting a subgroup of feature that are more useful to build a classifier that works well to our class. There are different selection methods: filter methods, wrapper methods, embedded feature selection, regularization, LASSO. - Filter methods : they extract from the data matrix prior to the learning step. It means that in the pre paration of data we decide to remove some features for some reason. They are based on the characteristic of the features themselves. One of the most used filter methods is correlation -based feature selection . The assumption is that many features in the da ta matrix will be correlated. If we have 2 features highly correlated, we can decide to use only one of the two. So, we compute the correlation matrix and look for variables that are a correlation coefficient that is high, over a threshold, and if 2 featur es have a high correlation value, we keep just one. We keep the one that has the lower average of correlation s with all the other features. In this example, we decide to use a threshold of 0.80. We see that the correlation between Feet 1 and 3 is higher th an 0.8 (0.86). To decide which one to take off let’s consider all the other correlations and compute the average. We decide to remove feature 3. This matrix should be computed on the training set and applied in test and validation. Another important technique is the significant -based feature selection . This involves evaluating the relationship between the feature and our target. Let’s consider 5 features and we have to perform a binary classification: patient with or without diabetes . We consider our feature 1 and consider it as a normal distributed, so we can perform the T - test and patient with diabetes vs patient without diabetes. This is a p -value. This is saying that feature 1 is able to discriminate people with and without diabe tes. We do that for all the features and decide to keep only those features that are related to the output, so that can significantly differentiate between the 2 classes. So, we remove the p -value higher than the threshold, in this case fixed at 0.05. If w e put together feature 2 and 4, they become very able to discriminate the 2 classes. - The wrapper method uses the learning technique to evaluate which features are more relevant. This method uses a machine learning algorithm and performance metric. The fi rst method that we are talking about is called forward stepwise selection. In the figure we have, on the right part, all the features considered. We start the model with no variables, and for this reason is called null mode. Then we put one feature. Here w e have 5 features, so we try to run our classifier with each single feature, we evaluate its performance for example with accuracy and we decide that the blue feature is the one having the best accuracy. So, the model with one variable will be the one with the blue feature. Then we go on and put another feature. Now we try all the different models with 2 features, always with the blue one decided before. We find that the combination blue -purple is the best one on terms of accuracy. We continue until a stopp ing rule is satisfied. This can be that there is no increase in accuracy, or when we have used all the features, or other things. A similar approach is used also for backward stepwise , which is the reverse. We start from the full model with all variables and remove one variable at the time. We remove the one that gets us the best accuracy. And we continue until the stopping criteria is satisfied. There is one method in between th ese last 2. This method is called floating forward stepwise selection. Let’s consider the model with 3 variable blue, purple and violet. We try to remove one feature, because we have never tried the couple purple and violet, because we started from the blu e. It is true that the best model with only one variable is blue, but there might be a pairing without blue that is bette r. So, we try to move backward or forward, and we remove only if the model that comes up has a better performance in terms of accuracy. Otherwise, we continue adding more features. These methods are apparently very wasteful in terms of consumption , so are ok if we have few variables. Feature projection: the most common method is principal component analysis. Feature projection means that we keep all the features, but we combine them somehow so that the number of colors in the matrix is reduced, but the information of all the features is maintained. The principal component analysis maps the original features space to a new space called the component space, and the features in the new space are linear combination of the previous features. The first component explains the majority of the variance and so on. 2. Model training: from all our data we divide these into 2 groups, the training an d testing. In the training phase we have a set of training examples, m, so we have m patients with different features. X is a vector representing a patient and containing hundreds of features. For each of these patients we have also an output, a target, y. In most cases these are categorical and binary variables. What we want to know during the training phase is the function f: given the training example we are able to map this exactly to our target y. So, is an iterative phase. We will find the function f with different approaches, and with this function f we are able to classify some unseen group of examples. So, we will use the function f to map some new examples to get the target of the test set. It’s very important that the 2 sets are very independent. Let’s assume that we have m patients, and that each patient has more than one radiological image. Imagine for example that we have 2 MRIs images, 2 different modalities. Each vector x doesn’t represent a patient, but just an image. So, need to pay very mu ch attention that all the images from one patient go into the same set. So, for example, patient 1 having 2 images must have both of them in the training set; a patient 2 having two other images must have both in the test. Never one in the test and the oth er in the training. If this doesn’t happen, it’s a kind of cheating. Let’s consider now a patient with 5 ECGs taken in 5 days. Theoretically, these are very similar. Let’s consider that we put 4 of these in the training and 1 in the test. What happens is that in the training phase the function f that predicts the output is very good. When we go to the test set, that particular ECG features will be very well predicted, because the model has already seen this example during the train. So, remember to divide patients and not features in the two sets. This is called stratified split. This is another way of seeing same things. We have our training set, we use the model and so we have some predictions, we compare our predictions with the output that we know, an d we iterate th ese until some stopping criteria is met. When we finish the training, we have a finale model that we apply to the test set to get the final predictions and then we evaluate our model basing on these predictions. Another important thing is that also the training set is usually divided into training and validation. if we have to determine some hyperparameters, we may want to decide what of these is better, basing on the results of the validation group . 3. Model evaluation: especially in biomed ical fields we have hundreds of patients to train and to find the best model. Then we have an independent cohort of patients with very similar characteristics on which we want to test this model. This could be obtained in very different ways: for example, we want to predict the develop of atrial fibrillation during hospitalization of patients after cardiac surgery. We want to predict if these patients will have atrial fibrillation, that is supraventricular arrhythmia during the hospitalization. O bviously, t here are data that the hospital has already taken before the operation . So, there are already retrospective data that the hospital already has . These are for example ECG recording, clinical variables…. We develop the model based on these data. From now on all the operated patients go into what is called the perspective group of patients. Then you make sure that these data are all taken by the same way, i.e., that they are aligned to protocols . Y ou can have, for example, a situation where some of the data ar e taken from one hospital and put into the training, and some from another hospital and put into the test . The very first way to divide the patients is called split -sample procedure (hold -out validation), where we simply divide the group of patients in training and test set. The training set has always been considered to be bigger, 2/3 of data. How should we divide these two groups? Let’s consider an output that is a binary variable, i n our example development of atrial fibrillation yes or no. We must try to have a balance between training and test sets. For example, if we find 20% of patients have pathology in the training, we must try to have a similar percentage in the test set. We m ust also try not to put all patients with the same characteristics together in the training. For example, if we put only men in the training and women in the test, then the model might have problems. The first time we split the data we have to be lucky and hope that when we go to do the test the results are comparable. Otherwise, we repeat this process several times and then take an average . To test our model, we can do the k-fold cross -validation : we have all the data as before, and we divide them into k folds. For example, we dive 200 patients into 20 groups of 10. We will use k -1 folds for training and the best model that is chosen is used to predict the data that are in the last fold. This is r epeated k times, because in the first iteration we train k -1 folds and test with the last one; in the second iteration we train the fold from the 2 to k and then test the model with the last, and so on. So, we repeat this procedure k times, and average res ults. This way is very good if we test for research, because we can have a very nice prediction. But we want to use this for the clinician: w e would like a model capable of predicting whether a patient was more likely to develop the disease. But here we ma de predictions of multiple models. If we want a single model, we cannot use k -fold cross validation. We have different approaches to do this: average the models, or take all the k folds (so, all the data) and train them in the model. And this final model w ill be the model to be used. Of course, in this last training where we used all the data, we will use the best parameters found previously with the k -fold cross validation. Otherwise, another simpler approach is to select the best fold and test only this. Another approach is called Bootstrapping. In Bootstrapping we have all the samples with all the data. We keep n values from the original set of data, but we take these samples with replacement. We mean that we randomly choose n samples, so some of these samples might be repeated and some will be excluded. In this example, sample 1,3 and N -1 are not taken anymore. Instead, in training we have taken sample 2 3 times and sample N 2 times. The samples that are not included in the training set will constitute the test sets. This is repeated several timed and the tests results will be averaged. Now, let’s see what we mean with the performance matrix. Let’s consider a binary problem, usually binary classification. The very basic thing to do is to build the confu sion matrix . We have true class and predicted class form the model. We have 4 different possibilities. The best that we want to get are true positive and true negative, which means that all patients that really develop AF are predicted to develop AF, and a ll the patients that don’t really have AF are predicted to not have AF. False negative means that we have a patient that in the reality have AF but that in our model they don’t develop AF. Or false positive that are patients that don’t have AF, but our mod el says that they will develop AF. From this confusion matrix we can compute a lot of metrics to describe our performance. Important note: it’s very important for us to produce the confusion matrix for the test set. If in the training we see that false neg ative and false positive are 50%, we understand that the model is not appropriate. But usually into the training the predictions are very good. What is important is that these predictions continue to be good also in the test set. Of course, we could build them also for validation set, for example if we have to choose some hyperparameters. So, we compute the accuracy on our validation set and the hyperparameters that have the best accuracy will be chosen as our hyperparameter set. Th is is a list of all the performance matrix that are more common. If we have unbalanced data, for example we have 50 patients and only 10 of them develop AF. If our model predicts all the patients not developing AF, it’s accuracy will be 80%. So, overall, the accuracy is very high. If we have an unbalanced data, we canno t use the accuracy as matrix. In this case sensitivity and positive predicted value will be 0. Moreover, an important thing is that during the training the data are balanced. Another possibility to measure performances is what is called the area under th e ROC curve. Usually when we have a classification model, we don’t have 1 or zero output, but we usually have a probability to belong to a certain class. We need to choose a threshold. But if we have one model, and we choose differently thresholds we may h ave different performances. What we can see in this plot is the true positive rate vs the false positive rate. So, the sensitivity over (1 – sensitivity). The perfect classifier is at the upper left corner: we want true positive rate equal to 1 if we want to recognize all the positive samples, and we don’t want to have false positive. In the plot, each line represents different models. If we are just flipping a coin, we are building a random classifier, on the diagonal line. Of course, the more we move towa rds the perfect classifier the better is the model. MISSING DATA AND OUTLIERS The ideal case We have a certain number o f subjects (rows) and variable (columns). The true scenario is that we have a matrix like this, where we have some grey dots or empty spaces, which represents outliers and missing data. Missing data are observation that we want to record, but we couldn’t for many reasons. F or example, the patient deliberately chooses not to answer some questions : for example, some do not answer questions about income , we aren’t able to have blood pressure information because patients have missed the visit, or for example the people come to the hospital, but some recording machines are broken. We want to realize the main reason for this lack of info . There are different missingness mechanisms underlying th e missing values and they were divided into these 3 classes: - Missing completely at random (MCAR): the patients with missing values are a random subset of our studied population. For example, we have patients with underwent biopsy, and the glass slide for a certain group of patients were broken, so the pathologists couldn’t analyze the biopsy. Missing data don’t depend either on the observed data or on the unobserved data. For researchist this is the best situation, because we don’t have statistically signi ficant differences with other patients. - Missing at random (MAR): missing values depends on some other information that we have already observed. For example, it has been shown that when we perform a depression study, it’s more likely that female answer all the question about the survey, while males are not very willing to respond to the questionnaires. So, for many males we miss some information. This means that if somehow, we are able to account for the sex, then we also may take into account the fact that some data are missing. In another study they were studied the we ight of a group of healthy people, and they found that data on weight were less likely to be recorded in young subjects. This because, on average, young people goes to the doctor less often than old. - Missing not at random (MNAR): missing values depend on the unobserved data themselves. It means that, for example, if we have a study on depression, patients with severe depressions are less likely to complete the survey on depression itself. If we have divided groups of depression people, in which the highest is severe, all the patients with severe depression don’t answer the questionnaires. So, our data will be very biased because we miss all the information from the severe depression subgroup. Another example: we want to see the effect of one drug and some patients didn’t came to the follow -up visit because of the adverse effect of the drug itself. Or another example about income: if we perform a questionnaire and one of the questions regard the income, people with high salary don’t respond. So, the data that we are missing is related to the value itself. Missing completely at random (MCAR) Example: a subject missed a coupled of visit, so we miss the value for that time point, because he had a work -related emergency. So, for example the blood pressure is no t recorded. Missing at random (MAR) Example: Male participants are more likely to refuse to fill out a depression survey, but this is not related to their depression levels. Missing not at random (MNAR) Example : Patient missed a check -up because the d rug made her/him sick (data on the level of drug is missed because of the level of drug itself) Missi ngness mechanisms The first two mechanisms and especially the first one were called ignorable mechanisms , meaning that the missingness of these data doesn’t depend on the values that we are measuring. So, what we have is not biased. This is sure for MCAR, while for MAR we might need to take something into account, but somehow, we can menage to solve the prob lem. If we have MNAR this is a problem because when we apply our classification algorithm, we need to model the mechanisms of missingness. We need to model something so that we take the information into account so that the model can learn something. Missingness mechanisms Let’s consider this cloud, where w e have only 2 features, where x is the observed independent variable, and the y is the dependent one. So, the black clouds represent the complete data, while in red we have some missing data. What we observe is that, when we have missing completely at random, the 2 clouds are similar, there is not bias introduced . They are just less, but they have basically the same distribution in these 2 different clouds. The missing values of course will reduce t he sample size, but not lead any bias. In the second case, missing at random, we have that missing values are related to x features. What we can see from the clouds is that subjects with higher value of x are more likely to be missing. So, the cloud of mi ssing points is right shifted, because the higher is x, the higher is the probability of data to be missed. We can also observe that the red cloud is also upshifted, but this is only due to the linear correlation obviously present between x and y, so if x increases also y increases. So, if values with high x are missed, also values with higher y are missed. But all these are different from missing not at random, where we see that the cloud of missing values is very much shifted towards higher values of y. It’s very likely that subjects with high value of y are missing. Similar consideration can be done for the x. Methods to minimize missing data in the design phase Here there are some methods to minimize missing data. - Standardized rules to optimize dat a collection: we may want to train the stuff that needs to collect the data. Try to coordinate much as possible data collection to make it easier for example for nurses and doctors, we should define the data that we want to collect, try to define some rang es that we want and so on - Pilot studies: if we plan to collect data from for example 1000 patients, just start from 50, and immediately we would realize which are the most difficult data to obtain, and so we can try to change the protocol of our study. It’ s very study dependent - Regular monitoring of data quality and completeness: let’s imagine again to collect data from 1000 patients and that we aren’t able to do the pilot studies. L et's look at all the data and see if there is anything wrong, in the first analysis Methods of dealing with missing data What can we do to perform our analysis and get the best results? Here there is the list of the possible methods to deal with missing data. 1. Complete -case analysis We just analyze cases where we have all the data for all the features. In this table patient number 4 and 9 didn’t have one of the 3 features, and we exclude these patients, all the correspondent row. Advantage: simplicity and comparability across analysi s Disadvantages: - Reduces statistical power (due to reduced sample size) - Doesn’t use all information, because cancelling the whole row we are cancelling also some information that aren’t missed - Estimates may be biased if data are not MCAR 2. Missing indicator method Let’s consider our database and we are not excluding patients with missing values, but we will add one variable, for example variable called ‘missing’, in which values are set to a fixed value (for example 0), even if this is unrealistic: for example, weight equal to 0. But also add an extra variable equal to 1, called dummy variable. So, all the patients that have 0 as weight have also 1 as dummy variable. Advantage: no observations are excluded Disadvantages: subject to bias. Let’s con sider this example. On the x axis we have the BMI, divided in 4 groups, and we have some outcome. We immediately see that there is a linear correlation between the BMI and the outcome. What happens if we consider 35% of missing values? We just categorize t he group of patients in which we don’t know the BMI in a fifth category. Doing this, we can see that from the second group we have a linear correlation, but we have a first group that is biased. Becomes difficult to see a straight correlation between outco mes and groups. 3. Single val ue imputation Here we decide to replace our missing values with some other reasonable values. How can we choose these reasonable values? The simplest approach is to consider the mean value of all the other patients. This of course may produce bias in our distribution: i n the figure on the left we have the BMI with some continuous values, and this is a histogram which represents the number of patients with different BMI. This is the observation from 1000 subjects. Here we can see a gaussian distribution. Then, let’s consi der the same data but consider that 35% of data were missing and so we remove 350 values and replace them with the average of all the other BMI values. The histogram has a very high peak in around the medium value. Another way to replace the missing v alues is what is called last observation carried forward (LOCF). Let’s consider longitudinal studies. This kind of study consider patient followed for many months and in each visit, we record something. So, we have all the values for all the follow up visi t of patients. For example, the patient presented until the sixth visit and then did not continue . What we can do is consider that the value of our score is the same as in the last visit. Another way is named regression -based single imputation. We have t he blue clouds that are points that we know: we know both the x and y values. We decide to perform a linear regression between x and y (black line). If some values of y are missing, but we know the value of x, we can just assume that y is on the regression line. In this case we are overestimating the correlation of variables. Otherwise, the variance is underestimated. Advantage: information from the observed data  better results than previous ones. Disadvantages: over -estimate of model fit and correlat ion estimates, weak variance. Another way is to replace the missing values as a random sample from a reasonable distribution. First of all, we need to identify the distribution from the data and try to estimate the parameters of this distribution. Once we have identified this distribution, then for our missing data we just replace one random sample from the distribution. Example: if gender is missing and we know that the % of male and female in our project is the same , then we can say that the distribution is a Bernoulli distribution with p = 0.5, and we just extract randomly numbers from this distribution. Advantage: data distribution is preserved Disadvantages: distribution assumption may not be reliable (or correct ), and representativeness of the generated data is doubtful (even when assumption is correct). For all these methods we don’t account for the uncertainty of the missing data. This means that usually standard errors are underestimated. 4. Sensitivity analyses with worst -case and best -case scenarios Another approach is to replace missing values with the worst or best value in the observed data. For example, let’s consider a study where we collect data from subjects that want to quit smoking. If we keep them a questionnaire and ask them how many cigarettes have smoked last week and they don’t answer, one approach is to take as replacement value the worst replies, for example 200. This because the hypothesis behind that is that if the patient doesn’t respo nd to this particular kind of question, it’s very likely that he smokes a lot. 5. Multiple imputation The aim is to provide unbiased and valid estimates based on as much information as possible. So, all the information that we have from the observed data. Mu ltiple imputation is based on 3 steps: - Generate different dataset, where we impute missing values slightly differently. If we consider for example replace missing values taking random value from the observed distribution, we can generate multiple datasets . So, we can perform these steps several time, obtaining several databases. - Perform an analysis on each of datasets - Combine the results into a single set of parameter estimates, standard errors and test statistics. In this way, we obtain 2 very important things: Introduce some random error, some kind of probability on our data. Of course, we cannot use deterministic imputations: for example, if we use the mean, it remains always the same. But, if we consider a normal distribution of data and we need to ex tract 20 values and repeat this 10 times, we will never always the same, but we will take 10 different sets. With these differences we can run an algorithm to predict something and try to see our final results, if our prediction is robust to these differen t values. Expectation -Maximization method Imagine having a model that we are building. In case we knew the missing values the model parameters can be estimated straightforward. But on the contrary, if we knew the model parameters then we can obtain pred ictions for missing values. The idea is: imagine knowing the age of subjects and we build the model only on these subjects. Then we try to estimate the age from the model . And then we train again the model using all the ages. So, we have estimated the para meters of model. And we iterate these 2 steps, the estimation of the unknown features, and then we put these features as input of our model, to derive the model parameters. And then we repeat until convergence. Working with missing data in a nutshell - su mmary Steps 1 and 2 are easy enough. For step 3, there are not guidelines, so the suggestion is to try different methods and see how the results are, if are introduces some bias and so on. EXAMPLE OF MISSING DATA We are dealing with lung cancer, that is the most common of death from cancer worldwide (25% of cancer deaths). There are some open clinical problems, because many patients with specific type of lung cancer, called non -small cell lung cancer (NSCLC) are no t treated correctly because there are some differences in patient characteristics. These differences can come form some bias in the population or from researches. One of the things that we can do is to help clinician with some decision support system. So, we support medical decisions based on some clinical information, taking information from electronical records, from different follow up visit. Of course, decision support system is usually based on some AI tool, for example some classifiers, some regressi on models . And that need to be built and trained with a very large dataset to be reliable and robust enough. And, of course, the bigger the dataset the better are the results. Unfortunately, clinical data tend to be noisy and tend to have many missing valu es. - We can filter out these missing data, but this would mean cancel some patient information  size decrease - We can impute missing data from independent dataset. This means that we have a population from a hospital, and we have some missing values for th ose kinds of patients. What we can do is to go into another hospital, and we collect there another group of patients, and derive the data replacing the missing data by that independent population. Instead of considering the mean of our population we go to an independent population and from that we compute the average and extract some replaceable data  independent dataset - Then we can use some machine learning algorithm, which tolerates missing data  model limit - Otherwise, we can impute missing data from t he complete ones that we have in our population. The question now is how can we compute these data and how this imputation affects our results. Aim: 1. To explore the effect of imputation on the classification performance of the model. They try different methods for imputation of missing data and compare the classification performance of the model with the different methods. 2. They try to determine if it’s better to rely on a smaller dataset with no imputed values (cancel each patient that have a missing value) or to use a large dataset but where missing data are imputed somehow. Data Features: clinical features related to the grade of tumor, respirator y volume and so basic variables as gender, age and tumor volume. The output is 2 years survival status , so a binary output (survived or not). We have 269 patient records, where only 108 records are complete while the other 161 have some missing values. Re sults On the x axis we have % of missing data, on the y we have the average absolute error. To do this part of analysis they are considering the data that they know and remove some % of data, trying to compute them. So, they are computing the difference between the missing value that they are computing with the correct value. What we can see is that the different lines have the same behavior, with the exception of the first, that has a very big error. The lowest error is obtained with the mean approach, i mputing the missing values with the average values. Here the results for the classifier, considering the complete data. They are using the complete data for building a survival model using different ML algorithms. We can see that the area under the rock curve is different and goes from 0.61 to 0.72. Let’s see what happens if we remove some different % of data and try to build the classifier again. With the complete data we have an accuracy of 0.72 and we can see that this decreases drastically when w e remove data. B ut the results seem to drop dramatically, but actually if we look closely at the y - axis, we see that the difference between the values is quite small . So, also if for example we remove 30% of data and replace them with any of these methods, the result is very similar. Now we can see another test they perform. What they try to do is train the model on the bigger set of patients (161), so the group of patients with the missing values, and then these missing values were replaced. They test this model on the remain patient (108). Again, they found the max AUC value of 0.68. Conclusions 1. They didn’t find any relationship between error in estimating the missing values and the prediction of their model. Independently of the method that they use to impute the missing values, t he model can be trained quite well. 2. The complete dataset is the best way to build the prediction model . 3. To obtain similar models in terms of AUC , this depends on ML models. It’s nice that they compare the different imputation methods, and they show that each of them performs in a similar way in this case. Another example Introduction There are 2 main problems related to tissue micro -arrays. The first is that these are very likely to have missing data. But also, the mechanism of tissue micro -arrays i s such we cannot consider data as missingness of TMA, because data tends to be correlated across variables. Material Statistical analysis First thing they did was to have a statistical approach to the data to see what kind of missing data they were dealing with. 1. They compared the prognosis of patients with and without missing values. So, if the missing values are missing completely at rando m, there would be no differences between the groups. 2. They compute the correlation between each pair of variables, and if the data are MCAR they should be uncorrelated Imputation methods 1. Mean substitution (MS) 2. Iterative multivariable regression technique For the survivor analysis they performed the cox regression analysis, used when we have for example time on the x axis and survivals on the y axis. We start from 1 and then it decreases because people dies in one group and then there is a separation of two lines because there is another which describes lower risks. We can find some variables to describe this separation. They perform this analysis for different models. 1. They consider the dataset with complete data 2. Then considered the dataset in which missi ng values were replaced by the average 3. Multiple iterative regression at two different times: MI - 4. MI+  they not only imputed the missing values of the 2 -expression level, but also impute the survival time. This means that in our population we have missing values for gene expression values, but we have also some missing values also on the survival time (patient never came to the follow -up visit or never answer the phone). So, they decide to impute these values as well or not. Simulations To make more robust results, they perform some simulations. They built some dataset with missing values, removed some data on the complete dataset in a way that data can be considered MCAR (100 datasets) or MAR (100 datasets). They perform this 1000 times, so basically , they have the complete dataset, which is smaller (5443 – with no missing data ) compared to the whole dataset (more than 11000 patients). From these data they extract 100 datasets considering data MCAR or MAR. Results All variables were used in the mod el and all values have some missing values. Of the 78 484 (11 212*7 variables) possible data points, 13 357 were missing (6%) in 5769 patients (51%). Let’s look at these curves. We have some time on the x axis and the survival p robability on the y axis th at decreases over time. What we can see is that subject without and with missing data have very different survival probability and this is actually very significant. If our data are MCAR these 2 curves should be no different, but this is not the case. What we see here is that if we exclude patients with missing data (dashed line) we would only have patients in the solid line group. So, basically would underestimate the total survival time of our population, because patients with missing data survives longer than the others. If we exclude these, we include some bias, because we underestimated the 2 -survival time of population. Not only: here we have the correlation between the different variables, and we can see that there are some points were the correlatio n is quite high. So, this means that the data are not MCAT. So, if they perform the complete case they would introduce some bias. Here what we can see is some parameters derived from cox analysis, and the true value is the horizontal line. We can see th at there are different cox models (complete case analysis CCA, imputation with the mean, imputation with the new methods). If we consider the population with MCAR data, the variability is much higher (1 st figure), and this is not good. This is related to t he fact that we know that some data have correlations, and they didn’t take into account this . Conclusions Results were very similar even If the CCA were less precise. They concluded that the proposed algorithm (MI+) is the best. How to recognize han dwritten digits – human Consider this sequence of digits. The human visual system most effortless recognizes these digits as 504192. This because our head is a kind of supercomputer that has trained and evolved over time . But it is not always easy to recognize letters and digits. If, for example, write a code to tell at the computer to recognize digits, this is not simple because it needs some rules. There is a lot of variability in written digits, and we need to take into account this. For example, a 9 has a loop at the top, and a vertical stroke in the bottom right, but this is not too simple to say into an algorithm. So, let’s introduce artificial neural network. How to recognize han dwritten digits – neural network Neur al networks approach to this problem in very different ways. The idea is not to define rules, but we take a very large number of written digits as training set and NN tries to learn these examples to recognize different digits. So, we can say that artifici al neural networks automatically infer the rules for recognize written digits from the examples. If we increase the number of examples, the better it can learn and improve the accuracy. Biological neurons Biological neuron receives many inputs from the de ndrites (information comes from other neuron). Then somehow summarize the information coming from other sources in the cellular body , and then this information is translated into another kind of signal, the action potential, and sent to the following neuro n through its axon. Human brain has thousands and thousands of neurons. Neurons convert the complex pattern that receive as input into a simple decision: to spike or not. T his simple action has influenced research in this field. Each neuron performs a ver y simple top cognitive function, it reduces complexity coming from all the inputs by somehow categorizing the inputs that it receives in different patterns, and at the end decides of spike or not. Inspired by this intuition researchers decided to build a n eural network. So, neural networks model is constituted by units that can combine multiple inputs into single output. Artificial neuron The first model of artificial neuron was this . At the center we have the equivalent of the soma, the cellular body; then we have the input, that are the information that arrives from dendrites, and then we have the output information, going through the axon to the next neuron. In this case we have n in puts. Each of these is weighted by a coefficient w, and this describe the importance of each of these inputs. Into the soma part there is some function. Firstly, all the weighted inputs are summed here, and then they can go into another block, to a functio n that decides somehow if the neuron shots or not. The output, in case of binary output, is influenced by the size of the weighted sum into the soma, according to a certain threshold. Example: cheese festival Suppose that you want to go to this festival, the cheese festival in Bra. Before deciding to go you have to take in consideration different aspects. These are the 3 main points: 1. Is the weather good? 2. Does your friend want to accompany you? 3. Is the festival near public transit? Each of these question s can be answered by yes or no. So, we can model our questions into 3 input variables. We need to weight these inputs to decide to go or not. Imagine that you really adore this festival and that the only thing that could stop you to go is bad weather. S o, the fact that