Data Science Preprocessing Steps Prior To ML – EP20 by Karan Bhandari

June 27, 2021 No Comments

All Data Science Preprocessing steps

Refer to https://github.com/kurtzace/Diary2021/issues/5 for code snippets related to this podcast.

AML Intensity project (text preprocessing) https://github.com/kurtzace/AML-Intensity

Approximate Transcript

This is a technology racing podcast and you are listening to current Bhandari. In this episode we will be covering data science pre processing steps.To clean, manage and ensure that you have the perfect data for your machine learning inputs.At first we expect you to eyeball the data using either Excel or LibreOffice Calc or maybe using pandas. Take a look at how the how the data looks like. Check how data entry is done. For example, people tend to put entries like, no clue unknown, not available any. So take a look at the data entry patterns. Beyond that, take a look at the commands in the data like.Some people say it effort to this URL refer to that URL or effort to to do instructions like some people place to do items, pending items.And then take a look at some generic names like name, place, animal things. Try to see if you know you see odd things like numeric names or extremely vague country codes, or something like or something that defies logic. For example, if you see engines are becoming beyond V8 engines or V12 engines, if you see something like 48 or 99, just think of that.To be like a data entry error and some people put 99 are some in finite value. Some people put 999 or something. Something like in finite because they can’t write that inverted 8 symbol and then of course try your best to remove personal identifiable data so that you don’t get into GDP. Our issues look for name, age, SSN, date of birth, blood group.In order to load your datasets into your am, I recommend you to use pandas. Pandas is an open source library that can help you to read CSV, Excel files an.If you also wish to eyeballin.The online portals Google Big Query is a very good viewing agent as well. You can import files in Avro CSV. Jason an. You can view how the file looks like, but imagine if you are using pandas then you do PD import pandas as PD and PD dot read CSV. Take a look at type DF dot head to leak. Take a look at the columns or you can directly access by doing DF columns.But if when you do DF redhead it’ll show you the first five rows an the possible entries that have been there against it.You can also specify an upfront any values. For example, pandas in the read underscore CSV method accepts the not available value. So if your data entry has mentioned things beyond what the generic ones are, you can add them to the NA values. Then after you’ve loaded it before, you.You know, do any further processing. Take a look at df.info. That will give you account of not null values, null values, total rows count so that you know whether if it’s under 50,000 you can process that within one machine. If it goes beyond that, maybe you have to think of setting up, you know, using Google Bigquery Redshift.Or data SQL, or as your data lake. Or maybe set up a multi machine environment with spark. But let’s not get into it. Imagine that if you know you have reasonable number of rows, you can type DF dot esnal to see how many numbers are there and you can do DF traditional dot some to see what is the total count of the nulls.Andosometimes some people have the habit of putting a zero against things like glucose, blood pressure, skin thickness. You know that you know these things cannot be 0 or BMI. You can go ahead and treat them like null. So what you can do is you can try it. DF put the list of all the columns and replace zero with NP dot NNNP comes from NUM PY.When you import number.And then take a look at.You know the domain knowledge, I mean based on your domain knowledge, like us as I had mentioned, the engine count or the weights of people who age of people. You know that it will be within a specific range.Type DDF dot describe data frame that describe and you’ll be able to see the mean, median, mode, minimum. Where does most of the range lie under the data? Look normalized to you.And then some, and you have to, you know, sometimes standardized data for.Pre 44 before you feed it into a male model, because many things appear to be different. For example, number of bedrooms maybe 234, but your area will be.Like 5000 square feet, 2000 square feet, and I think both are different, but the machine learning algorithm will feel that you know the one that size represents appears to be more significant because of the kind of value it is having. So try to use SK loans, preprocessing min, Max scaler. You can give an arranged OK between zero to one. Run a fit transform on the Min Max scaler and you will be able to.Standardize the values between zero and one, so your number of bedrooms will appear between the range of zero to 1, even though it may be 10 bedrooms and even the price or the square feet can be normalized into a value between zero to one. You could also use, you know the standard scaler instead of the min, Max scaler and standard scaler also has similar type of output.Sometimes you know your data set may have same meanings, so various columns may talk about the same kind of meaning. For example, I think in the complete data science course they had mentioned that.When religion voting history participating in associations are bringing, maybe they reflect your attitude towards immigration so you can merge them into a single column and then sometimes you are analyzing things that are related to housing, but some space information about particle information comes into your data and you know that they are not correlated or something like your blood parimeter comes in into your.Data set or maybe even ID column. You know the idea of SQL does not give any effect to your data science model. It may just confuse it further and treat the ID to be like a significant factor in.Predicting the outcome so you can remove uncorrelated data an.That way you will ensure that you know it’s not creeping and you can also study is so. So if if you’re if you’re doing something like a linear regression or logistic regression, if you use statsmodel, you’ll be able to study the P values and even the F statistiques. So if the P value is.Around 0.050, then it’s considered that you know the.Data set is significant, so there is also something called as.The variance inflation factor, so that also comes from statsmodel. I’ll talk about that very soon.I think I’m jumping the gun here, but yeah, let’s go back to data whitening an then we will go on and head towards to the P values and.The stats related information.Sometimes you may have to convert if you have unique values between 2:00 to 10:00. For example, if there are a few categories that OK, this is non veg veg.Order something like carnivorous herbivorous or omnivorous. There are three categories, so you can convert the small number of categories to one hot encoded format and it will be represented like a binary number, like 001.110 or something like that so.For that you can use PD dot get dummies to use the one hot encoding. Or you could use SK learns preprocessing one hot encoder.An there is also something Called’s label encoder where you can mention.The rank of the category. For example, if you feel you know gold has a better rank or silver, has a medium rank and bronze has third rank, you can represent them to be like a category of 123.Out, so it’s not recommended that you use gold, platinum, silver as machine learning inputs. You there is also one more lesser used option called Bina Riser which is given by SK Learn, but that’s not used very often.Sometimes if you have large ranges then how do you deal with it? Maybe you could take into account for example BMI values may be spread out over a / a range like 20 to 40. Then you could claim like OK of BMI less than 18 is underweight and maybe between 18 to 25 normal and 2230 years overweight and 30 to 40 is obese so you can take that approach an.Try to reduce your categories. You can use DF dot log to locs to you know. Take a look at what the ranges are and create new data columns to.Do.Sure that you know. Do you have reduced the number of categories?Now before I get into image preprocessing.I would like to talk about other things. For example, there are a lot of statistics variables that you may have to acquaint yourself with that.Off, for example, there is mean which is almost like the middle part, and then there is the median where most of your data set is lying under. So for example, the median height of most of the world is between.5 foot to 6 foot.An if they are adults an and you have to find out what’s the median and mode is like the most frequently occurring one, and then there is something called as variance inflation factor. That is, it estimates how much is the variance of coefficient is inflated because of linear dependence with other predictors. For example, it tells.I think if you take a look at a VIF of one, you need to study.So so anyway, so VIF under 10, if we IS is under under 10, it’s borderline, but if VIF is between one to five, it’s perfectly OK.And maybe if you’re VIF is exceeding 10, that’s a cause of concern. For example, if you see too much variance in the mileage or the engine capacity, then you know we are F comes into picture so.I saw it. It shows it talks about situations where you know you have an anomaly in the data set. You can use a VIF.So just try to see if your VIF is under 125. You can use the stats model VIF.So you can see which one is an outlier. For example, you may find that you know certain extremely skilled people like Michael Jordan is skewing your basketball skill measurement.Anne, and sometimes you have extremely odd values because of OK. One extremely rich individual purchase the entire city. I mean these are outliers and it’s not very generic. But then your entire analysis gets skewed by a very large margin.And on some it’s good to create checkpoints while do you process your data because you’ll be dropping. You can also do DF drop to drop your unnecessary data columns, but I feel that make a copy like DF copy can make a new copy of your data set before you do something very different so you can come back to these checkpoints. And of course I forgot to tell you like just take a look at some unique values to see.RDF dot Uni can help you to judge certain criteria that I had mentioned in the past.And grouping, I mean working with time is tricky.So you have to deal with different formats, so it’ll be good if you use a PD .2 datetime an try to see if you can do that and you can create groups also by using lock variable so you can see that OK columns 1/2/10 belongs to Group One. Columns 2/2/15 belongs to all columns, 15 to 20 belongs to group 2.So you can convert them to groups if you feel there are too many one hot encoded values.An yeah you can just see if you can use NP dot where to see if you know your.Data is exceeding the dot median values an. That way you can see if you know certain things are exceeding the median. You can have a new column like that.I know some people can even balance the data set by using.By using techniques like averaging by using basic averaging an. If you have to apply a function to every intro to every entry. For example, imagine if you are going through a list of companies and you also want a description about the company from Yahoo API. Then what you can do is you can do DF the column name and do dot apply and then you can use the pan.Do pythons request module to call Yahoo Information API and get one line description about every company? Or maybe the current stock price? So that way you can you know preprocess or data set or or get the ranking of the company so that you know certain things to augment your data set is very healthy.I’m.So I was talking something about the P value and P value is nothing but.You know you have to see what is causing evidence for your hypothesis testing in the sense it’s it’s used to check your null hypothesis value out the alternate ones, or to challenge the status quo.And.If your P values.If your P values are very, very high.Then you need to take a look at.8 versus the lower P value.As I had mentioned, the P value can assist you with finding out if something is adding noise to your data set. For example, if I just generate a random value an add a new column an it’s adding a lot of noise to your data set. So for P value is very high. For example, if the value is 0.1 or even or even 10, then that means that kind of high. The kind of hypothesis that you’re seeing that this value is actually helping your linear or logistic.Audio roomno. Machine learning model.It can actually give a prediction whether it’s really helping or not. So if you have a P value which is very low, that is less than 0.05. That means you’re that means the column is extremely helpful to predict, so this statsmodel can give your P value and then there is one F regressions package that comes with escalon that can help you with the P value to find out if a particular column is significant in helping you to predict. So if something which is uncorrelated and.An if you feel that you can drop it if the P value is coming to be 0.1, then you know you can reject your hypothesis that you know this column is actually helping you to do something very significant.Now, with respect to images, some images are slightly are extremely different. Actually, it’s you have to treat it in a different way. Sometimes if you’re dealing with documents, it’s better to run an OCR on top of it with the help of Microsoft Cognitive services or a WS text, extract an or maybe Google’s or GCP offering of OCR. The vision offering to you know do the OCR.Can help and then you can perform any are on top of it so when you do any are. Let’s talk about anywhere after the image section is done.I will talk about I’ll come to that very soon. So when you get images, you have to. You know sometimes fix the alignments, fix pixel values, maybe make it into grayscale.You need to ensure that you know they’re not very transformed, or maybe you can convert to black and white or convert to grayscale.Anne.Sometimes you’re OK with just the edges of the images, so if you have too much noise, for example, if you scan the entire ID card, I don’t think your object detection algorithms are powerful enough to do, you know, differentiate between what is the name, what is the address? So sometimes you’re just interested in, you know, extracting the face part of it. Then you could, you know, help it with the removal of certain features that look different, so you can use the histogram module to compare.Images features.And then you can use open CV to resize them, open them, view them, transform them.Or do basic things and you know do edge detection.Do corner detection.An there are a lot of image preprocessing techniques like you know removal of noise. I think resizing is very important otherwise.Young, you may be running out of RAM, and when it comes to text NL TK toolkit worth 2 VAC.I’m explode. They are very effective to deal with.Any us?So any artist named entity recognition an it’s nothing, but I’m using the term. Any are but actually what I’m talking about is text processing because.You can analyze everything. So suppose if you were asked to work on, you know some news data set then and if you asked to do something like find out the most significant topic in it or find out the most significant variable, then you know you need to remove punctuation’s remove stop words like A and remove certain things that.Good cause.Ambiguity, for example, some people do something called as stemming and lemmatization. So lemmatization is. You can for example you have good, better, best or amazing. You can convert it to something like good. So then you have lesser words to deal with. If you do stemming and lemmatization will convert super later form.Will convert the Super letter form like extremely good or something like.Bravest Orpheus is to something like only fear so brave that’s the help of the stemmer.You could all stitch this into one pipeline and preprocess your text and that can help you know to reduce the amount of text that is going in. An even maybe could help you with the frequency counting.There is a very good library called his work to work that can help you to distinguish.Word related Dome.Matching or correlation. For example, you know that King and Queen are similar to each other, but there is a difference of a male and a female. So if you do King minus queen.It may give rise to mail, so you can do such kind of arithmetic over subtraction addition with words so you know we know that you know band and restriction is very close to embargo or.Things like computer and mouse.Are closely related compared to something very different like God, so you can use Google Word, two VEC preloaded models to help you to get the similarity score of one word versus the other, so that can help you to do word counting. For example, if you see the show notes, there is a link to a project called Aml intensity and I’m counting the intensity of of embargo blacklist.OFAC orderuh.The sanction on a particular company and that’s helping to calculate the intensity. Take a look at the project and you will be able to perform certain tasks related to text and the most important is any are the entire concept of any artist named entity recognition, so you can find out which one looks like a company, which one looks like.Date which one looks like a number, which one looks like a noun pronoun. Parts of speech. So that’s called parts of speech tagging. You can use space. He library to help you to do that.Image text data preprocessing is an extremely vast field. You may, as a data analyst, spend most of your time doing that.Because it’s real world data is not very clean. Of course, if you download some of the popularly available datasets like MNIST.Or diabetics one who data they are cleaned. Or by the Census Bureau.They’re extremely clean, but real world data is not that clean. I think you may need to spend some time to put your head around it, clean it, process it, and you may have to revisit it. It’s not just you, do it one time and you forget about it. You have to keep persisting it, save it and NPZ format, Civitan CSV format, save it in Excel format. After you’ve done it and maybe even save the intermediate steps you have to revisit it.So I wish you all the best in your data pre processing tasks. This is current binary signing out.I’m working as a developer in society general. You could contact me on Twitter or on LinkedIn.I’m known as KURTZAC codesys on these platforms goodbye.

Podcast: Play in new window | Download

Kurtzace