Skip to content

Instantly share code, notes, and snippets.

@alancriaxyz
Created July 12, 2024 00:38
Show Gist options
  • Save alancriaxyz/1dff09dd2d3cc76b91d07912ba358f71 to your computer and use it in GitHub Desktop.
Save alancriaxyz/1dff09dd2d3cc76b91d07912ba358f71 to your computer and use it in GitHub Desktop.

Video:

https://www.youtube.com/watch?v=bmmQA8A-yUA

Texto:

This machine learning course is created for beginners who are learning in 2024. The course begins with a machine learning roadmap for 2024, emphasizing career paths and beginner-friendly theory. Then the course moves on to hands-on practical applications and a comprehensive end-to-end project using Python. Todd have created this course. She is an experienced data science professional. Her aim is to demystify machine learning concepts, making them accessible and actionable for newcomers and to bridge the gap in existing educational resources, setting you on a path to success in the evolving field of machine learning. Looking to step into machine learning or data science? It's about starting somewhere practical yet powerful. In this introductory course, Machine Learning for Beginners, we're going to cover the basics of machine learning and we're going to put that into practice by implementing it in a real-world case study. I'm Tadhe Vasayan, co-founder of LunarTech, where we are making data science and AI more accessible for individuals and businesses. If you're looking for machine learning, deep learning, data science, or AI resources, then check out the free resources section in LunarTech.ai or our YouTube channel where you can find more content and you can dive into machine learning and in AI. We're going to start with machine learning roadmap, where in this detailed section, we're going to discuss the exact skillset that you need to get into machine learning. We're also going to cover the definition of machine learning, what is a common career path, and a lot of resources that you can use in order to get into machine learning. Then we're going to start with the actual theory. We're going to touch base, the basics. We're going to learn what are those different fundamentals in machine learning. Once we have learned the theory and we have also looked into the machine learning roadmap, we're going to put our theory into practice. We're going to conduct an end-to-end, a basic yet powerful case study where we're going to implement the linear regression model. We're going to use it both for causal analysis and for predictive analytics for Californian house prices. We're going to find out the features that drive the Californian house values. And we're going to discuss the step-by-step approach for conducting a real-world data science project. At the end of this course, you are going to know the exact machine learning roadmap for 2024, what are the exact skillset and the action plan that you can use to get into machine learning and in data science. You are going to learn the basics when it comes to machine learning. You are going to implement it into actual machine learning project, end-to-end, including implementing Pandas, NumPy, Scikit-learn, TouchModels, Metaflip, and Seaborn in Python for a real-world data science project. Dive into machine learning with us. Start simple, start strong. Let's get started. Hi there. In this video, we are going to talk about how you can get into machine learning in 2024. First, we are going to start with all the skills that you need in order to get into machine learning. Step by step, what are the topics that you need to cover and what are the topics that you need to study in order to get into machine learning? We are going to talk about what is machine learning. Then we are going to cover step-by-step what are the exact topics and the skills that you need in order to become a machine learning researcher or just get into machine learning. Then we're going to cover the type of exact projects you can complete. So examples of portfolio projects in order to put it on your resume and to start to apply for machine learning related jobs. And then we are going to also talk about type of industries that you can get into once you have all the skills and you want to get into machine learning, so the exact career path and what kind of business titles are usually related to machine learning. We are also going to talk about the average salary that you can expect for each of those different machine learning related positions. At the end of this video, you are going to know what exactly machine learning is, where is it used, what kind of skills are there that you need in order to get into machine learning in 2024, and what kind of career path with what kind of compensation you can expect with the corresponding business titles when you want to start your career in machine learning. So we will first start with the definition of machine learning, what machine learning is and what are the different sorts of applications of machine learning that you most likely have heard of but you didn't know that it was based on machine learning. So what is machine learning? Machine learning is a brand of artificial intelligence, of AI, that helps you build models based on the data and then learn from this data in order to make different decisions. And it's being used across different industries, starting from healthcare till entertainment in order to improve the customer experience, identify customer behavior, improve the sales for the businesses, and it also helps governments to make decisions. So it really has a wide range of applications. So let's start with the healthcare. For instance, machine learning is being used in the healthcare to help with the diagnosis of diseases. It can help to diagnose cancer. During the COVID, it helped many hospitals to identify whether people are getting more severe side effects or they are getting pneumonia based on those pictures. And that was all based on machine learning and specifically computer vision. In the healthcare, it's also being used for drug discovery. It's being used for personalized medicine, for personalizing treatment plans, to improve the operations of the hospitals to understand what is the amount of people and patients that hospital can expect in each of those days per week, and also to estimate the amount of doctors that need to be available, the amount of people that the hospital can expect in the emergency room based on the day or the time of the day. And this is basically not a machine learning application. Then we have machine learning in finance. Machine learning is being largely used in finance for different applications, starting from fraud detection and credit cards or in other sorts of banking operations. It's also being used in trading, specifically in combination with quantitative finance, to help traders to make decisions where they need to go short or long into different stocks or bonds or different assets just in general to estimate the price that those stocks will have in assets in real time in a most accurate way. It's also being used in retail. It helps to understand and estimate the demand for certain products in certain warehouses. It also helps to understand what is the most appropriate or closest warehouses that the items for that corresponding customer should be shipped. So it's optimizing the operations. It's also being used to build different recommender systems and search engines like the infamous Amazon is doing. So every time when you go to Amazon and you are searching for a project or product, you will most likely see many article recommenders. And that's based on machine learning. Because Amazon is gathering the data and comparing your behavior. So based on what you have bought, based on what you are searching to other customers. and those items to other items in order to understand what are the items that you will most likely will be interested in and eventually will buy it. And that's exactly based on machine learning and specifically different source of recommender system algorithm. And then we have marketing where machine learning is being heavily used because this can help to understand what are these different tactics and specific targeting groups that you belong and how retailers can target you in order to reduce their marketing costs and to result in high conversion rates. So to ensure that you buy their product. Then we have machine learning in autonomous vehicles. That's based on machine learning and specifically deep learning applications. And then we have also natural language preprocessing which is highly related to the famous chat GPT. I'm sure you are using it and that's based on the machine learning and specifically the large language models. So the transformers large language models where you will go in and providing your text and then question and the chat GPT will provide answers to you or in fact any other virtual assistant or chatbots. Those are all based on machine learning. And then we have also smart home devices. So Alexa is based on machine learning also in agriculture. Machine learning is being used heavily these days to estimate what the weather conditions will be. To understand what will be the production of different plants. What will be the outcome of this. To understand and to make decisions. Also how they can optimize those crop yields to monitor soil health and for different sorts of applications that can just in general improve the revenue for the farmers. Then we have of course in the entertainment. So the vivid example is Netflix that uses the data that you are providing related to the movies and also based on what kind of movies you are watching. Netflix is building this super smart recommender system to recommend you movies that you most likely will be interested in and you will also like it. So in all this machine learning is being used and it's actually super powerful topic and super powerful field to get into. And in the upcoming 10 years this is only going to grow. So if you have made that decision or you are about to make that decision to get into machine learning. Continue watching this video because I'm going to tell you exactly what kind of skills you need and what kind of practice projects you can complete in order to get into machine learning in 2024. So you first need to start with mathematics. You also need to know Python. You also need to know statistics. You will need to know machine learning and you will need to know some NLP to get into machine learning. So let's now unpack each of those skill sets. So independent the type of machine learning you are going to do. You need to know mathematics and specifically you need to know linear algebra. So you need to know what is matrix multiplication, what are the vectors, matrices, dot product. You need to know how you can multiply those different matrices, matrix with vectors, what are these different rules, the dimensions. Also what does it mean to transform a matrix, the inverse of the matrix, identity matrix, diagonal matrix. Those are all concepts as part of linear algebra that you need to know as part of your mathematical skill set in order to understand those different machine learning algorithms. Then as part of your mathematics you also need to know calculus. and specifically differential theory. So you need to know these different theorems such as chain rule, the rule of differentiating when you have sum of instances, when you have constant multiply with an instance, when you have sum but also subtraction, division, multiplication of two items and then you need to take the derivative of that. What is this idea of derivative? What is the idea of partial derivative? What is the idea of hessian? So first order derivative, second order derivative and it would be also great to know a basic integration theory. So we have differentiation and the opposite of it is integration theory. So this is kind of basic. You don't need to know too much when it comes to calculus but those are basic things that you need to know in order to succeed in machine learning. Then the next concepts such as discrete mathematics. So you need to know what is this idea of graph theory, what are these combinations, combinators, what is this idea of complexity which is the important when you want to become a machine learning engineer because you need to understand what is this big O notation. So you need to understand what is this complexity of n squared, complexity of n, the complexity of n log n and about that you need to know some basic mathematics when it comes which comes from usually high school. So you need to know multiplication, division, you need to understand multiplying amounts which are within the parentheses, you need to understand different symbols that represent mathematical values, you need to know this idea of using x's, y's and then what is x squared, what is y squared, what is x to the power 3. So different exponents of the different variables. Then you need to know what is logarithm, what is logarithm at the base of 2, what is logarithm at the base of e and then at the base of 10. What is the idea of e, so what is the idea of pi, what is this idea of exponent logarithm and how does those transform when it comes to taking derivative of the logarithm, taking the derivative of the exponent. Those are all values and all topics that are actually quite basic. They might sound complicated but they are actually not. So if someone explains you clearly then you will definitely understand it from the first go. And for this to understand all those different mathematical concepts, so linear algebra, calculus, differential theory and then discrete mathematics and those different symbols, you need to go for instance and look for courses or youtube tutorials that are about basic mathematics for machine learning and AI. Don't go and look further, you can check for instance Khan Academy which is a quite favorite when it comes to learning math both for uni students and also for just people who want to learn mathematics and this will be your guide. Or you can check our resources at learnertech.ai because we are going also to provide these resources for you in case you want to learn mathematics for your machine learning journey. The next skill set that you need to gain in order to break into machine learning is the statistics. So you need to know this is a must statistics if you want to get into machine learning and in AI in general. So there are a few topics that you must study when it comes to statistics and those are descriptive statistics, multivariate statistics, inferential statistics, probability distribution and some by... thinking. So let's start with descriptive statistics. When it comes to descriptive statistics you need to know what is this idea of mean, median, standard deviation, variance and just in general how you can analyze the data with using this descriptive measures. So distance measures but also variational measures. Then the next topic area that you need to know as part of your statistical journey is the inferential statistics. So you need to know those infamous theories such as central limit theorem, the law of large numbers and how you can relate to this idea of population, sample, unbiased sample and also a hypothesis testing, confidence interval, statistical significance and how you can test different theories by using this idea of statistical significance, what is the power of the test, what is type 1 error, what is type 2 error. So this is super important for understanding different sorts of machine learning applications if you want to get into machine learning. Then you have probability distributions and this idea of probabilities. So to understand those different machine learning concepts you need to know what are probabilities. So what is this idea of probability, what is this idea of sample versus population, what is what does it mean to estimate probability, what are those different rules of probability. So conditional probability and those probability values and rules that usually you can apply when you have a probability of multipliers, probability of two sums and then you need to understand some popular and you need to know some popular probability distribution functions and those are Bernoulli distribution, binomial distribution, normal distribution, uniform distribution, exponential distribution. So those are all super important distributions that you need to know in order to understand this idea of normality, normalization. Also this idea of Bernoulli trials and relating different probability distributions to different higher level statistical concepts. So rolling a dice, the probability of it, how it is related to Bernoulli distribution or to binomial distribution and those are super important when it comes to hypothesis testing but also for many other machine learning applications. So then we have the Bayesian thinking. This is super important when it comes to more advanced machine learning but also some basic machine learning. You need to know what is the Bayes theorem which arguably is one of the most popular statistical theorems out there comparable also to the central limit theorem. You need to know what is conditional probability, what is this Bayes theorem and how does it relate to conditional probability, what is this Bayesian statistics idea at very high level. You don't need to know everything in super detailed but you need to know these concepts at least at high level in order to understand machine learning. So to learn statistics and the fundamental concepts of statistics you can check out the fundamentals to statistics course at lunatech.ai. Here you can learn all the strict wide concepts and topics and you can practice it in order to get into machine learning and to gain the statistical skills. The next skill set that you must know is the fundamentals to machine learning. So this covers not only the basics of machine learning but also the most popular machine learning algorithms. So you need to know this different mathematical side of this algorithm, step-by-step how they work, what are the benefits of them, what are the demerits, and which one to use for what type of applications. So you need to know this categorization of supervised versus unsupervised versus semi-supervised. Then you need to know what is this idea of classification, regression, or clustering. Then you need to know also time series analysis. You also need to know these different popular algorithms including linear regression, also logistic regression, LDA, so linear discriminant analysis. You need to know KNN. You need to know decision trees, both classification and regression case. You need to know random forest, bagging, but also boosting. So popular boosting algorithms like light GBM, GBM, so gradient boosting models. And you need to know XGBoost. You also need to know some unsupervised learning algorithms such as k-means, usually used for clustering. You need to know DBSCAN, which becomes more and more popular in clustering algorithms. You also need to know hierarchical clustering. And for all this type of models, you need to understand the idea behind them, what are the advantages and disadvantages, whether they can be applied for unsupervised versus supervised versus semi-supervised. You need to know whether they are for regression, classification, or for clustering. Beside of these popular algorithms and models, you also need to know the basics of training a machine learning model. So you need to know this process behind training, validating, and testing your machine learning algorithms. So you need to know what does it mean to perform hyperparameter tuning, what are those different optimization algorithms that can be used to optimize your parameters, such as GD, SGD, SGD with momentum, Adam, and Adam-V. You also need to know the testing process, this idea of splitting the data into train, validation, and then test. You need to know resampling techniques, why are they used, including the bootstrapping and cross validation, and there's different sorts of cross validation techniques, such as leave-one-out cross validation, k-fold cross validation, validation set approach. You also need to know this idea of matrix and how you can use different matrix to evaluate your machine learning models, such as classification type of metrics like F1 score, F beta, precision, recall, cross entropy, and also you need to know some matrix that can be used to evaluate regression type of problems, like the mean squared error, so MAC, root mean squared error, RMAC, MAA, so the absolute version of those different sorts of errors, and or the residual sum of squares. For all these cases, you not only need to know high level what those algorithms or those topics or concepts are doing, but you actually need to know the mathematics behind it, their benefits, the disadvantages, because during the interviews you can definitely expect questions that will test not only your high-level understanding, but also this background knowledge. If you want to learn machine learning and you want to gain those skills, then feel free to check out my Fundamentals to Machine Learning course at lunatech.ai, or you can also check out and download for free the Fundamentals to Machine Learning handbook that I published with FreeCodeCamp. Then the next skill set that you definitely need to gain is the knowledge in Python. Python is actually one of the most popular programming languages out there and it's being used across software engineers, AI engineers, machine learning engineers, data scientists. So this is the universal language, I would say, when it comes to programming. So if you're considering getting into machine learning in 2024, then Python will be your friend. So knowing the theory is one thing, then implementing it in the actual job is another. And that's exactly where Python comes in handy. So you need to know Python in order to perform descriptive statistics, in order to train machine learning model or more advanced machine learning models or deep learning models. You can use for training validation and for testing of your models, and also for building different sorts of applications. So Python is super powerful. Therefore, it's also gaining such a high popularity across the globe because it has so many libraries. It has TensorFlow, PyTorch, both that are must if you want to not only get into machine learning, but also the advanced levels of machine learning. So if you are considering the AI engineering jobs or machine learning engineering jobs, and you want to train for instance, deep learning models, or you want to build large language models or generative AI models, then you definitely need to learn PyTorch and TensorFlow, which are frameworks that are used in order to implement different deep learning, which are advanced machine learning models. Here are a few libraries that you need to know in order to get into machine learning. So you definitely need to know pandas, NumPy, you need to know scikit-learn, SciPy, you also need to know NLTK for the text data. You also need to know TensorFlow and PyTorch for a bit more advanced machine learning. And besides this, there are also data visualization libraries that I would definitely suggest you to practice with, which are the Matplotlib, and specifically the PyPlot, and also the Seaborn. When it comes to Python, besides knowing how to use libraries, you also need to know some basic data structures. So you need to know what are these variables, how you can create variables, what are the matrices, arrays, how the indexing works, and also what are the lists, what are the sets, so unique lists, what are the ways that you can, what are the different operations you can perform. How does the sorting, for instance, work? I would definitely suggest you know some basic data structures and algorithms such as binary sort, so an optimal way to sort your arrays. You also need to know the data processing in Python, so you need to understand how to identify missing data, how to identify duplicates in your data, how to clean this, how to perform feature engineering, so how to combine multiple variables or to perform operations to create new variables. You also need to know how you can aggregate your data, how you can filter your data, how you can sort your data. And of course, you also need to know how you can run A-B testing in your Python, and how you can train machine learning models, how you can test it, and how you can evaluate them, and also visualize the performance of it. If you want to learn Python, then the easiest thing you can do is just to Google for Python for data science, or Python for machine learning tutorials, or blogs, or you can even try out the Python for data science course at learnertech.ai in order to learn all these basics and usage of these libraries and some practical examples when it comes to Python for machine learning. The next skill set that you need to gain in order to get into machine learning is the basic introduction to NLP, natural language processing. So you need to know how to work with text data, given that these days the text data is the cornerstone of all these different advanced algorithms such as GPTs, transformers, the attention mechanisms, so those applications that you see as part of building chatbots or this personalized applications based on text data, they are all based on NLP. So therefore you need to know these basics of NLP to just get started with machine learning. So you need to know this idea of text data, what are those strings, how you can clean text data, so how you can clean those dirty data that you get, and what are the steps involved such as lowercasing, removing punctuation, tokenization, also what is this idea of stemming, lemmatization, stopwords, how you can use the NLTK in Python in order to perform this cleaning. You also need to know this idea of embeddings and you can also learn this idea of the TF-IDF which is a basic NLP algorithm. You also can learn this idea of word embeddings, the subword embeddings, and the character embeddings. If you want to learn the basics of NLP you can check out those concepts and learn them as part of the blogs. There are many tutorials on YouTube. You can also try the introduction to NLP course at ludotech.ai in order to learn these different basics that form the NLP. If you want to go beyond this intro till medium level machine learning and you also want to learn a bit more advanced machine learning and this is something that you need to know after you have gained all these previous skills that I mentioned, then you can gain this knowledge and the skill set by learning deep learning. And also you can consider getting into generative AI topics. So you can for instance learn what are the RNNs, what are the ANNs, what are the CNNs. You can learn what is this autoencoder concept, what are the variational autoencoders, what are the generative adversarial networks, so GENs. You can understand what is this idea of reconstruction error. You can understand these different sorts of neural networks, what is this idea of backpropagation, the optimization of these algorithms by using these different optimization algorithms such as GD, HGD, HGD momentum, ADAM, ADAMW, RMSProp. You can also go one step beyond and you can get into generative AI topics such as the variational autoencoders like I just mentioned but also the large language models. So if you want to move towards the NLP side of generative AI and you want to know how the chat GPT has been invented, how the GPTs work or the BERT model, then you will definitely need to get into this topic of language models. So what are the N-grams, what is the attention mechanism, what is the difference between the self-attention and attention, what is one head self-attention mechanism, what is multi-head self-attention mechanism. You also need to know at high level this encoder-decoder architecture of transformers. So you need to know the architecture of transformers and how they solve different problems of recurrent neural networks or RNNs and LSTMs. You can also look into this encoder-based or decoder-based algorithm such as GPTs or BERT models. and those all will help you to not only get into machine learning but also stand out from all the other candidates by having this advanced knowledge. Let's now talk about different sorts of projects that you can complete in order to train your machine learning skill set that you just learned. So there are a few projects that I suggest you to complete and you can put this on your resume to start to apply for machine learning roles. The first application in the project that I would suggest you to do is building a basic recommender system whether it's a job recommender system or movie recommender system. In this way you can showcase how you can use for instance text data from those job advertisement, how you can use numeric data such as the ratings of the movies in order to build a top-end recommender system. This will showcase your understanding of the distance measures such as cosine similarity, this KNN algorithm idea and this will help you to tackle this specific area of data science and machine learning. The next project I would suggest you to do will be to build a regression based model. So in this way you will showcase that you understand this idea of regression, how to work with predictive analytics and predictive model that has a dependent variable, a response variable that is in the numeric format. So here for instance you can estimate the salaries of the jobs based on the characteristics of the job, based on this data which you can get for instance from open source web pages such as Kaggle and you can then use different source of regression algorithms to perform your predictions of the salaries, evaluate the model and then compare the performance of these different machine learning regression based algorithms. For instance you can use the linear regression, you can use the decision trees regression version, you can use the random forest, you can use GBM, XGBoost in order to showcase and then in one graph to compare this performance of these different algorithms by using a single regression ML model metrics. So for instance the RMSE. This project will showcase that you understand how you can train a regression model, how you can test it and validate it and it will showcase your understanding of optimization of this regression algorithm, you understand this concept of hyper parameter tuning. The next project that I would suggest you to do in order to showcase your classification knowledge so when it comes to predicting a class for an observation given the feature space would be to build a classification model that would classify emails being a spam or not a spam. So you can use a publicly available data that will be describing a specific email and then you will have multiple emails and the idea is to build a machine learning model that would classify the email to the class 0 and class 1 where class 0 for instance can be your not being a spam and 1 being a spam. So with this binary classification you will showcase that you know how to train a machine learning model for classification purposes and you can here use for instance logistic regression, you can use also the decision trees for classification case, you can also use random forest, the eggshells for classification, GBM for classification and with all these models you can then obtain the performance metrics such as F1 score or you can plot the rope curve or the area under the curve metrics and you can also compare those different classification. classification models. So in this way, you will also tackle another area of expertise when it comes to the machine learning. Then the final project that I would suggest you to do would be from the unsupervised learning to showcase another area of expertise. And here you can for instance use data to your customers into good, better and best customers based on their transaction history, the amount of money that they are spending in the store. So in this case, you can for instance use k-means, DBSCAN, hierarchical clustering and then you can evaluate your clustering algorithms and then select the one that performs the best. So you will then in this case cover yet another area of machine learning which would be super important to showcase that you can not only handle recommender systems or supervised learning but also unsupervised learning. And the reason why I suggest you to cover all these different areas and complete these four different projects is because in this way you will be covering different expertise and areas of machine learning. So you will be also putting projects on your resume that are covering different sorts of algorithms, different sorts of matrix and approaches and it will showcase that you actually know a lot from machine learning. Now if you want to go beyond the basic or medium level and you want to be considered for medium or advanced machine learning levels and positions, you also need to know bit more advanced which means that you need to complete bit more advanced projects. For instance, if you want to apply for generative AI related or large language models related positions, I would suggest you to complete a project where you are building a very basic large language model and specifically the pre-training process, which is the most difficult one. So in this case, for instance, you can build a baby GPT and I'll put a here link that you can follow where I'm building a baby GPT, a basic pre-trained GPT algorithm where I am using a text data, publicly available data, in order to process data in the same way like GPT is doing and the encoder part of the transformers. In this way, you will showcase to your hiring managers that you understand this architecture behind transformers, architecture behind the large language models and the GPTs and you understand how you can use PyTorch in Python in order to do this advanced NLP and generative AI task. And finally, let's now talk about the common career path and the business titles that you can expect from a career in machine learning. So assuming that you have gained all the skills that are must for breaking into machine learning, there are different sorts of business titles that you can apply in order to get into machine learning. So when it comes to machine learning, you can get into machine learning and there are different fields that are covered as part of this. So first, we have the general machine learning researcher. Machine learning researcher is basically doing a research, so training, testing, evaluating different machine learning algorithms. There are usually people who come from academic background, but it doesn't mean that you cannot get into machine learning research without getting a degree in statistics, mathematics, or in machine learning specifically. Not at all. So if you have this desire and this passion for reading, doing research, and you don't mind reading research papers, then machine learning researcher job would be a good fit for you. So machine learning combined with research, then sets you for the machine learning researcher role. Then we have the machine learning engineer. So machine learning engineer is the engineering version of the machine learning expertise, which means that we are combining machine learning skills with the engineering skills, such as productionization, building pipelines, so end-to-end robust pipeline skill, ability of the model, considering all these different aspects of the model, not only from the performance side when it comes to the quality of the algorithm, but also the scalability of it and when putting it in front of many users. So when it comes to combining engineering with machine learning, then you get machine learning engineering. So if you're someone who is a software engineer and you want to get into machine learning, then machine learning engineering would be the best fit for you. So for machine learning engineering, you not only need to have all these different skills that I already mentioned, but you also need to have this good grasp of scalability of algorithms, the data structures and algorithms type of skill set, the complexity of the model, also system design. So this one converges more towards and similar to the software engineering position combined with machine learning, rather than your pure machine learning or AI role. Then we have the AI research versus AI engineering position. So the AI research position is similar to the machine learning research position. And the AI engineer position is similar to the machine learning engineer position with only single difference. When it comes to machine learning, we are specifically talking about this traditional machine learning. So linear regression, logistic regression, and also random forest, accuracy boost, bagging. And when it comes to AI research and AI engineer position, here we are tackling more the advanced machine learning. So here we are talking about deep learning models such as RNNs, LSTMs, GRUs, CNNs, so computer vision applications. And we are also talking about generative AI models, large language models. So we are talking about the transformers, the implementation of transformers, the GPTs, T5, all these different algorithms that are from more advanced AI topics rather than traditional machine learning. For those, you will then be applying for AI research and AI engineering positions. And finally, you'll have this different source of approvations, niches from AI. For instance, NLP research, NLP engineer, or even data science positions for which you will need to know machine learning and knowing machine learning will set you apart for this source of positions. So also the business titles such as data science or technical data science positions, NLP researcher, NLP engineer, for this all, you will need to know machine learning. And knowing machine learning will help you to break into those positions and those career paths. If you want to prepare for your deep learning interviews, for instance, and you want to get into AI engineering or AI research, then I have recently published for free a full course with a hundred interview questions with answers for a span of 7.5 hours that will help you to prepare for your deep learning interviews. And for your machine learning interviews, you can check out my Fundamentals to Machine Learning course at lunatech.ai, or you can download the Machine Learning Fundamentals Handbook from FreeCodeCamp and check out my blogs and also free resources at lunatech.ai in order to prepare for your interviews and in order to get into machine learning. Let's now talk about the list of resources that you can use in order to get. into machine learning in 2024. So to learn statistics and the fundamental concepts of statistics, you can check out the Fundamentals to Statistics course at lunatech.ai. Here you can learn all these required concepts and topics and you can practice it in order to get into machine learning and to gain these statistical skills. Then when you want to learn machine learning, you can check the Fundamentals to Machine Learning course at lunatech.ai to get all these basic concepts, the Fundamentals to Machine Learning, and the list of comprehensive and the most comprehensive list of machine learning algorithms out there as part of this course. Then you can also check out the Introduction to NLP course at lunatech.ai in order to learn the basic concepts behind natural language preprocessing. And finally, if you want to learn Python and specifically Python for machine learning, you can check out the Python for Data Science course at lunatech.ai. And if you want to get access to these different projects that you can practice your machine learning skills that you just learned, you can either check out the Ultimate Data Science Bootcamp that covers a specific course, the Data Science Project Portfolio course, covering multiple of these projects that you can train your machine learning skills and put on your resume. Or you can also check my GitHub account or my LinkedIn account where I cover many case studies including the Baby TPT. And I will also put the link to this course and to this case study in the link below. And once you have gained all the skills, you are ready to get into machine learning in 2024. In this lecture, we will go through the basic concepts in machine learning that is needed to understand and follow conversations and solve main problems using machine learning. Strong understanding of machine learning basics is an important step for anyone looking to learn more about or work with machine learning. We will be looking at the three concepts in this tutorial. We will define and look into the difference between supervised and unsupervised machine learning models. Then we will look into the difference between the regression and classification type of machine learning models. After this, we will look into the process of training machine learning models from scratch and how to evaluate them by introducing performance metrics that you can use depending on the type of machine learning model or problem you are dealing with. So whether it's a supervised or unsupervised, whether it's regression versus classification type of problem. Machine learning methods are categorized into two types depending on the existence of the labeled data in the training data set, which is especially important in the training process. So we are talking about the so-called dependent variable that we saw in the section of fundamentals to statistics. Supervised and unsupervised machine learning models are two main type of machine learning algorithms. One key difference between the two is the level of supervision during the training phase. Supervised machine learning algorithms are guided by the labeled examples, while unsupervised algorithms are not. This learning model is more reliable, but it also requires a larger amount of labeled data, which can be time-consuming and quite expensive to obtain. Examples of supervised machine learning models include regression and classification type of models. On the other hand, unsupervised machine learning algorithms are trained on unlabeled data. The model must find patterns and relationships in the data without the guidance of correct outputs, so we no longer have a dependent variable. So unsupervised ML models require training data that consists only of independent variables or features, and there is no dependent variable for labeled data that can supervise. the algorithm while learning from the data. Examples of unsupervised models are clustering models and outlier detection techniques. Supervised machine learning methods are categorized into two types depending on the type of dependent variable they are predicting. So we have regression type and we have classification type. Some key differences between regression and classification include output type, so the regression algorithms predict continuous values while the classification algorithms predict categorized values. Some key differences between regression and classification include the output type, the evaluation metrics, and their applications. So with regards to the output type, regression algorithms predict continuous values while classification algorithms predict categorical values. With regards to the evaluation metric, different evaluation metrics are being used for regression and classification tasks. For example, mean squared error is commonly used to evaluate regression models while accuracy is commonly used to evaluate classification models. When it comes to applications, regression and classification models are used in entirely different types of applications. Regression models are often used for prediction tasks while classifications are used for decision-making tasks. Regression algorithms are used to predict the continuous value such as price or probability. For example, a regression model might be used to predict the price of a house based on its size, location, or other features. Examples of regression type of machine learning models are linear regression, fixed effect regression, exhibit regression, etc. Classification algorithms on the other hand are used to predict the categorical value. These algorithms take an input and classify it to one of the several predetermined categories. For example, a classification model might be used to classify emails as a spam or as not a spam or to identify the type of animal in an image. Examples of classification type of machine learning models are logistic regression, hgboost classification, random forest classification. Let us now look into different type of performance metrics we can use in order to evaluate different type of machine learning models. For regression models, common evaluation metrics include residual sum of squared, which is the RSS, mean squared error, which is the MSE, the root mean squared error or RMSE, and the mean absolute error, which is the MAE. These metrics measure the difference between the predicted values and the true values with a lower value indicating a better fit for the model. So let's go through these metrics one by one. The first one is the RSS or the residual sum of squares. This is a metric commonly used in the setting of linear regression when we are evaluating the performance of the model in estimating the different coefficients. And here the beta is a coefficient and the yi is our dependent variable value and the y hat is the predicted value. As you can see, the RSS or the residual sum of square or the beta is equal to sum of all the squared of yi minus y hat across all i is equal to 1 till n, where i is the index of the each row or the individual or the observation included in the data. The second metrics is the MSE or the mean squared error, which is the average of the squared differences between the predicted values and the true values. So as you can see, MSE is equal to 1 divided to n and then sum across all i yi minus y hat squared. As you can see, the RSS and the MSE are quite similar in terms of their formulas. The only difference is that we are adding a 1 divided to n and then this makes it the average across all the squared differences between the predicted values and the true values. differences between the predicted value and the actual true value. A lower value of MSE indicates a better fit. The RMSE, which is the Root Mean Squared Error, is the square root of the MSE. So as you can see, it has the same formula as MSE, only with the difference that we are adding a square root on the top of this formula. A lower value of RMSE indicates a better fit. And finally, the MAE, or the Mean Absolute Error, is the average absolute difference between the predicted values, so the y-hat, and the true values, or yi. A lower value of this indicates a better fit. The choice of regression metrics depends on the specific problem you are trying to solve, and the nature of your data. For instance, the MSE is commonly used when you want to penalize large errors more than the small ones. MSE is sensitive to outliers, which means that it may not be the best choice when your data contains many outliers, or extreme values. RMSE, on the other hand, which is the square root of the MSE, makes it easier to interpret. So it's easier interpretable, because it's in the same units as the target variable. It is commonly used when you want to compare the performance of different models, or when you want to report the error in a way that it's easier to understand and to explain. The MIA is commonly used when you want to penalize all errors equally, regardless of their magnitude. And MIE is less sensitive to outliers compared to MSE. For classification models, common evaluation metrics include accuracy, precision, recall, and F1 score. These metrics measure the ability of the machine learning model to correctly classify instances into the correct categories. Let's briefly look into these metrics individually. So the accuracy is a proportion of correct predictions made by the model. It's calculated by taking the correct predictions, so the correct number of predictions, and divide to all number of predictions, which means correct predictions plus incorrect predictions. Next, we will look into the precision. So precision is a proportion of true positive predictions among all positive predictions made by the model. And it's equal to true positive divided to true positive plus false positive, so all number of positives. True positives are cases where the model correctly predicts a positive outcome, while false positives are the cases where the model incorrectly predicts a positive outcome. Next metric is recall. Recall is a proportion of true positive predictions among all actual positive instances. It's calculated as the number of true positive predictions divided by the total number of actual positive instances, which means dividing the true positive to true positive plus false negative. So, for example, let's say we are looking into a medical test. A true positive would be a case where the test correctly identifies a patient as having a disease, while a false positive would be a case where the test incorrectly identifies a healthy patient as having the disease. And the final score is the F1 score. The F1 score is the harmonic mean or the usual mean of the precision and recall, with the higher value indicating a better balance between precision and recall. And it's calculated as the two times, recall times precision divided to recall plus precision. For unsupervised models, such as class string models, whose performance is typically evaluated using metrics that measure the similarity of the data points within a cluster and the dissimilarity of the data points between different clusters, we have three types of metrics that we can use. Homogeneity is a measure of the degree to which all of the data points within a single cluster belong to the same class. A higher value indicates a more homogeneous cluster. So as you can see, homogeneity of H, where H is the simply the short way of describing homogeneity, is equal to 1 minus conditional entropy given cluster assignments divided to the entropy of predicted class. If you are wondering what this entropy is, then stay tuned as we are going to discuss this entropy whenever we will discuss the clustering as well as decision trees. The next matrix is the silhouette score. Silhouette score is a measure of the similarity of the data point to its own cluster compared to the other clusters. A higher silhouette score indicates that the data point is well matched to its own cluster. This is usually used for db-scan or k-means. So here, the silhouette score can be represented by this formula. So the SO or the silhouette score is equal to BO minus AO divided to the maximum of AO and BO. Where SO is the silhouette coefficient of the data point characterized by O. AO is the average distance between O and all the other data points in the cluster to which O belongs. And the BO is the minimum average distance from O to all the clusters to which O does not belong. The final matrix we will look into is the completeness. Completeness is another measure of the degree to which all of the data points that belong to a particular class are assigned to the same cluster. A higher value indicates a more complete cluster. Let's conclude this lecture by going through the step-by-step process of evaluating a machine learning model at a very simplified version. Since there are many additional considerations and techniques that may be needed depending on a specific task and the characteristics of the data. Knowing how to properly train a machine learning model is really important since this defines the accuracy of the results and conclusions you will make. The training process starts with the preparing of the data. This includes splitting the data into training and test sets, or if you are using more advanced resampling techniques that we will talk about later, then splitting your data into multiple sets. The training set of your data is used to feed the model. If you have also a validation set, then this validation set is used to optimize your hyperparameters and to pick the best model while the test set is used to evaluate the model performance. When we will approach more lectures in this section, we will talk in detail about these different techniques, as well as what the training means, what the test means, what validation means, as well as what the hyperparameter tuning means. Secondly, we need to choose an algorithm or set of algorithms and train the model on the training data and save the fitted model. There are many different algorithms to choose from, and the appropriate algorithm will depend on the specific task and the characteristics of the data. As a third step, we need to adjust the model parameters to minimize the error on the training set by performing hyperparameter tuning. For this, we need to use validation data, and then we can select the best model that results in the least possible validation error rate. In this step, we want to look for the optimal set of parameters that are included as part of our model to end up with the model that has the least possible error, so it performs in the best possible way. In the final two steps, we need to evaluate the model. We are always interested in a test error rate and not the training or the validation error rates, because we have not used the test set, but we have used the training and validation sets. So this test error rate will give you an idea of how well the model will generalize to the new unseen data. We need to use the optimal set of parameters from hyperparameter tuning stage and the training data to train the model again with these hyperparameters and with the best model. So we can use the best fitted model to get the best result. the predictions on the test data. And this will help us to calculate our test error rate. Once we have calculated the test error rate and we have also obtained our best model, we are ready to save the predictions. So once we are satisfied with the model performance and we have tuned the parameters, we can use it to make predictions on a new unseen data on the test data and compute the performance metrics for the model using the predictions and the real values of the target variable from the test data. And this completes this lecture. So in this lecture, we have spoken about the basics of machine learning. We have discussed the difference between the unsupervised and supervised learning models, as well as repression versus classification. We have discussed in details the different type of performance metrics we can use to evaluate different type of machine learning models, as well as we have looked into the simplified version of the step-by-step process to train a machine learning model. In this lecture, lecture number two, we will discuss a very important concept which you need to know before considering and applying any statistical or machine learning model. Here, I'm talking about the bias of the model and the variance of the model and the trade-off between the two, which we call bias-variance trade-off. Whenever you are using a statistical, econometrical or machine learning model, no matter how simple the model is, you should always evaluate your model and check its error rate. In all these cases, it comes down to the trade-off you make between the variance of the model and the bias of your model, because there is always a catch when it comes to the model choice and the performance. Let us firstly define what bias and the variance of the machine learning model are. The inability of the model to capture the true relationship in the data is called bias. Hence, the machine learning models that are able to detect the true relationship in the data have low bias. Usually, complex models or more flexible models tend to have a lower bias than simpler models. So mathematically, the bias of the model can be expressed as the expectation of the difference between the estimate and the true value. Let us also define the variance of the model. The variance of the model is the inconsistency level or the variability of the model performance when applying the model to different datasets. When the same model that is trained using training data performs entirely differently than on the test data, this means that there is a large variation or variance in the model. Complex models or more flexible models tend to have a higher variance than simpler models. In order to evaluate the performance of the model, we need to look at the amount of error that the model is making. For simplicity, let's assume we have the following simple regression model, which aims to use a single independent variable X to model the numeric Y dependent variable. That is, we fit our model on our training observations where we have a pair of independent and dependent variables, X1, Y1, X2, Y2, up to Xn, Yn. And we obtain an estimate for our training observations, F hat. We can then compute this, let's say F hat X1, F hat X2, up to F hat Xn, which are the estimates for our dependent variable Y1, Y2, up to Yn. And if these are approximately equal to this actual values, so one hat is approximately equal to Y1, Y2 hat is approximately equal to Y2 hat, et cetera, then the training error rate would be small. However, if we are really interested in whether our model is predicting the dependent variable appropriately, we want to, instead of looking at the training error rate, we want to look at our test error rate. So the error rate of the model is the expected square difference between the real test values and their predictions, where the predictions are made using the machine learning model. We can rewrite this error rate as a. of two quantities whereas you can see the left part is the amount of fx minus f hat x squared and the second entity is the variance of the error term. So the accuracy of y hat as a prediction for y depends on the two quantities which we can call the reducible error and the irreducible error. So this is the reducible error equal to fx minus f hat x squared and then we have our irreducible error or the variance of epsilon. So the accuracy of y hat as a prediction for y depends on the two quantities which we can call the reducible error and the irreducible error. In general the f hat will not be a perfect estimate for f and this inaccuracy will introduce some errors. This error is reducible since we can potentially improve the accuracy of f hat by using the most appropriate machine learning model and the best version of it to estimate the f. However even if it was possible to find a model that would estimate f perfectly so that the estimated response took the form of y hat is equal to fx, our prediction would still have some error in it. This happens because y is also a function of the error rate epsilon which by definition cannot be predicted by using our feature x. So there will always be some error that is not predictable. So variability associated with the error epsilon also affects the accuracy of the predictions and this is known as the irreducible error because no matter how well we will estimate f we cannot reduce the error introduced by the epsilon. This error contains all the features that are not included in our model so all the unknown factors that have an influence on our dependent variable but are not included as part of our data. But we can reduce the reducible error rate which is based on two values the variance of the estimate and the bias of the model. If we were to simplify the mathematical expression describing the error rate that we got then it's equal to the variance of our model plus squared bias of our model plus the irreducible error. So even if we cannot reduce the irreducible error we can reduce the reducible error rate which is based on these two values the variance and the squared bias. So though the mathematical derivation is out of the scope of this course just keep in mind that the reducible error of the model can be described as the sum of the variance of the model and the squared bias of the model. So mathematically the error in the supervised machine learning model is equal to the squared bias in the model the variance of the model and the irreducible error. Therefore in order to minimize the expected test error rate so on unseen data we need to select the machine learning method that simultaneously achieves low variance and low bias. And that's exactly what we call bias-variance trade-off. The problem is that there is a negative correlation between the variance and the bias of the model. Another thing that is highly related to the bias and the variance of the model is the flexibility of the machine learning model. So flexibility of the machine learning model has a direct impact on its variance and on its bias. Let's look at these relationships one by one. So complex models or more flexible models tend to have a lower bias but at the same time complex models or more flexible models tend to have higher variance than simpler models. So as the flexibility of the model increases the model finds the true patterns in the data easier which reduces the bias of the model. At the same time the variance of such models increases. So as the flexibility of the model decreases model finds it more difficult to find the true patterns in the data which then increases the bias of the model but also decreases the variance of the model. Keep this topic in mind and we will continue this topic in the next lecture when we will be discussing the topic of overfitting and how to solve the overfitting problem by using regularization. In this lecture, lecture number three, we will talk about a very important concept called overfitting and how we can solve overfitting by using different techniques, including regularization. This topic is related to the previous lecture and to the topics of error of the model, train error rate, test error rate, bias, and a variance of the machine learning model. Overfitting is important to know and also how to solve it with regularization because this topic can lead to inaccurate predictions and a lack of generalization of the model to new data. Knowing how to detect and prevent overfitting is crucial in building effective machine learning models. Questions about this topic are almost guaranteed to appear during every single data science interview. In the previous lecture, we discussed the relationship between model flexibility and the variance as well as the bias of the model. We saw that as the flexibility of the model increases, model finds the true patterns in the data easier, which reduces the bias of the model. But at the same time, the variance of such models increases. So as the flexibility of the model decreases, model finds it more difficult to find the true patterns in the data, which then increases the bias of the model and decreases the variance of the model. Let's first formally define what the overfitting problem is as well as what the underfitting is. So overfitting occurs when the model performs well in the training while the model performs worse on the test data. So you end up having a low training error rate but a high test error rate. And in an ideal world, we want our test error rate to be low, or at least that the training error rate is equal to the test error rate. Overfitting is a common problem in machine learning where a model learns the detail and the noise in the training data to the point where it negatively impacts the performance of the model on this new data. So the model follows the data too closely, closer than it should. This means that the noise or random fluctuations of the training data is picked up and learned as concepts by the model, which it should actually ignore. The problem is that the noise or random component of the training data will be very different from the noise in the new data. The model will, therefore, be less effective in making predictions on new data. Overfitting is caused by having too many features, too complex of a model, or too little of the data. When the model is overfitting, then also the model has high variance and low bias. Usually, the higher is the model flexibility, the higher is the risk of overfitting because then we have higher risk of having a model following the data too closely and following the noise. So underfitting is the other way around. Underfitting occurs when our test error rate is much lower than our training error rate. Given that overfitting is much bigger of a problem and we want, ideally, to fix the case when our test error rate is large, we will only focus on the overfitting. And this is also the topic that you can expect during your data science interviews, as well as something that you need to be aware of whenever you are training a machine learning model. All right, so now that we know what overfitting is, we should now talk about how we can fix this problem. There are several ways of fixing or preventing overfitting. First, you can reduce the complexity of the model. We saw that higher the complexity of the model, higher is the chance of following the data, including the noise, too closely, resulting in overfitting. Therefore, reducing the flexibility of the model will reduce the overfitting as well. This can be done by using a simpler model with fewer parameters or by applying a regularization technique, such as L1 or L2 regularization, that we will talk in a bit. is to collect more data. The more data you have, the less likely your model will overfit. Third, and another solution, is using resampling techniques, one of which is cross-validation. This is a technique that allows you to train and test your model on different subsets of your data, which can help you to identify if your model is overfitting. We will discuss cross-validation, as well as other resampling techniques, later in this section. Another solution is to apply early stopping. Early stopping is a technique where you monitor the performance of the model on a validation set during the training process, and stop the training when the performance starts to decrease. Another solution is to use ensemble methods. By combining multiple models, such as decision trees, overfitting can be reduced. We will be covering many popular ensemble techniques in this course as well. Finally, you can use what we call dropout. Dropout is a regularization technique for reducing overfitting in neuron networks by dropping out, or setting to zero, some of the neurons during the training process. Because from time to time, dropout-related questions do appear during the data science interviews for people with no experience. So if someone asks you about dropout, then at least you will remember that it's a technique used to solve overfitting in the setting of deep learning. It's worth noting that there is no one solution that works for all types of overfitting, and often a group of these techniques that we just talked about should be used to address the problem. We saw that when the model is overfitting, then the model has high variance and low bias. By definition, regularization, or what we also call shrinkage, is a method that shrinks some of the estimated coefficients towards zero, to penalize unimportant variables for increasing the variance of the model. This is a technique used to solve the overfitting problem by introducing the little bias in the model while significantly decreasing its variance. There are three types of regularization techniques that are widely known in the industry. The first one is the ridge regression, or L2 regularization. The second one is the Lasso regression, or the L1 regularization. And finally, the third one is the dropout, which is a regularization technique used in deep learning. We will cover the first two types in this lecture. Let's now talk about ridge regression, or L2 regularization. So ridge regression, or L2 regularization, is a shrinkage technique that aims to solve overfitting by shrinking some of the model coefficients towards zero. Ridge regression introduces little bias into the model while significantly reducing the model variance. Ridge regression is a variation of linear regression, but instead of trying to minimize the sum of squared residuals that linear regression does, it aims to minimize the sum of squared residuals added on top of the squared coefficients, what we call L2 regularization term. Let's look at a multiple linear regression example with p independent variables, or predictors, that are used to model the dependent variable y. If you have followed the statistical section of this course, you might also recall that the most popular estimation technique to estimate the parameter of the linear regression, assuming its assumptions are satisfied, is the ordinary least squares, or the OLS, which finds the optimal coefficients by minimizing the sum of squared residuals, or the RSS. So ridge regression is pretty similar to the OLS, except that the coefficients are estimated by minimizing a slightly different cost or loss function. This is the loss function of the ridge regression where beta j is the coefficients of the model for variable j, beta zero is the intercept, and x i is the sum of the coefficients of the model. ij is the input value for the variable j and observation i. yi is a target variable or the dependent variable for observation y. And n is the number of samples. And lambda is what we call regularization parameter of the ridge regression. So this is the loss function of OLS that you can see here and added a penalization term. So it's combined, what we call RSS. So if you check out the very initial lecture in this section where we spoke about different metrics that can be used to evaluate regression type of models, you can see RSS and the definition of RSS. Well, if you compare this expression, then you can easily find that this is the exact formula for the RSS added with an intercept. And this right term is what we call a penalty amount, which basically represents the lambda times the sum of the squared of the coefficients included in our model. Here, lambda, which is always positive, so it's always larger than equals 0, is a tuning parameter or the penalty parameter. This expression of the sum of squared coefficients is called L2 norm, which is why we call this L2 penalty base regression or L2 regularization. In this way, ridge regression assigns a penalty by shrinking their coefficients towards 0, reducing the overall model variance. But these coefficients will never become exactly 0. So the model parameters are never set to exactly 0, which means that all p predictors of the model are still intact. This one is a key property of ridge regression to keep in mind, that it shrinks the parameters towards 0, but never exactly sets them equal to 0. L2 norm is a mathematical term coming from linear algebra, and it's standing for Euclidean norm. We spoke about the penalty parameter lambda, what we also call the tuning parameter lambda, which serves to control the relative impact of the penalty on the regression coefficient estimates. When the lambda is equal to 0, the penalty term has no effect, and the ridge regression will introduce the ordinary least squares estimates. But as the lambda increases, the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates approach to 0. What is important to keep in mind, which you can also see from this graph, is that in ridge regression, large lambda will assign a penalty to some variables by shrinking their coefficients towards 0. But they will never become exactly 0, which becomes a problem when you are dealing with a model that has a large number of features, and your model has a low interpretability. Ridge regression's advantage over ordinary least squares is coming from the earlier introduced bias-variance trade-off phenomenon. As the lambda, the penalty parameter, increases, the flexibility of the ridge regression feed decreases, leading to decreased variance, but increased bias. The main advantages of ridge regression are solving overfitting. Ridge regression can shrink the regression coefficient of less important predictors towards 0. It can improve the prediction accuracy as well by reducing the variance and increasing the bias of the model. Ridge regression is less sensitive to outliers in the data compared to linear regression. Ridge regression is computationally less expensive compared to LASSO regression. The main disadvantage of ridge regression is the low model interpretability, as the p, so the number of features in your model, is large. Let's now look into another regularization technique called LASSO regression, or L1 regularization. By definition, LASSO regression, or L1 regularization, is a shrinkage technique that aims to solve overfitting by shrinking some of the model coefficients towards 0 and setting some to exactly 0. LASSO regression, like ridge regression, introduces later bias into the model, while significantly reducing model variance. There is, however, a small difference. difference between the two regression techniques that makes a huge difference in their results. We saw that one of the biggest disadvantages of rich regression is that it will always include all the predictors or the key predictors in the final model. Whereas in case of LASSO, it overcomes this disadvantage. So large lambda, or penalty parameter, will assign a penalty to some variables by shrinking their coefficients towards zero. In case of rich regression, they will never become exactly zero, which becomes a problem when your model has a large number of features and it has a low interpretability. And LASSO regression overcomes this disadvantage of rich regression. Let's have a look at the loss function of LASSO regularization. So this is the loss function of OLS, which is the left part of the formula called RSS, combined with a penalty amount, which is the right-hand side of the expression. The lambda times sum of the absolute values of the coefficients beta j. As you can see, this is the RSS that we just saw, which is exactly the same as the loss function of the OLS. And then we are adding the second term, which basically is the lambda, the penalization parameter, multiplied by the sum of the absolute value of the coefficient beta j, where j goes from 1 till p, and p is number of predictors included in our model. Here, once again, the lambda, which is always positive, larger than equals 0, is a tuning parameter or the penalty parameter. This expression of the sum of squared coefficients is called L1 norm, which is why we call this L1 penalty-based regression or L1 regularization. In this way, LASSO regression assigns a penalty to some of the variables by shrinking their coefficients towards 0 and setting some of these parameters to exactly 0. So this means that some of the coefficients will end up being exactly equal to 0, which is a key difference between the LASSO regression versus the regio regression. The L1 norm is a mathematical term coming from the linear algebra. And it's standing for Manhattan norm or distance. You might see here a key difference when comparing the visual representation of the LASSO regression compared to the visual representation of the regio regression. So if you look at this point, you can see that there will be cases where our coefficients will be set to exactly 0. This is where we have this intersection. Whereas in case of reg regression, you can recall that there was not a single intersection. So the numbers where the circle was close to the intersection points, but there was not a single point when there was an intersection and the coefficients were put to 0. And that's the key difference between two regression type of models, between these two regularization techniques. The main advantages of LASSO regression are solving overfitting. So LASSO regression can shrink the regression coefficients of less important predictors towards 0 and some to exactly 0. As the model filters some variables out, LASSO indirectly performs also what we call feature selection, such that the resulted model is highly interpretable and with less features, and much more interpretable compared to reg regression. LASSO can also improve the prediction accuracy of the model by reducing the variance and increasing the bias of the model, but not as much as the reg regression. Earlier, when speaking about correlation, we also briefly discussed the concept of causation. We discussed that correlation is not a causation. And we also briefly spoke the method used to determine whether there is a causation or not. That model is the infamous linear regression. And even if this model is recognized as a simple approach, it's one of the few methods that allows identifying features that have an impact, or a statistically significant impact, on a variable that we are studying. are interested in, and we want to explain. And it also helps you identify how and how much there is a change in the target variable when changing the independent variable values. To understand the concept of linear regression, you should also know and understand the concepts of dependent variable, independent variable, linearity, and statistical significant effect. Dependent variables are often referred to as response variables or explained variables. By definition, dependent variable is a variable that is being measured or tested. It's called the dependent variable because it's thought to depend on the independent variables. So you can have one or multiple independent variables, but you can have only one dependent variable that you are interested in, that is your target variable. Let's now look into the independent variable definition. So independent variables are often referred as regressors or explanatory variables. And by definition, independent variable is the variable that is being manipulated or controlled in the experiment and is believed to have an effect on the dependent variable. Put it differently, the value of the dependent variable is thought to depend on the value of the independent variable. For example, in an experiment to test the effect of having a degree on the wage, the degree variable would be your independent variable, and the wage would be your dependent variable. Finally, let's look into the very important concept of statistical significance. We call the effect statistically significant if it's unlikely to have occurred by random chance. In other words, a statistically significant effect is one that is likely to be real and not due to a random chance. Let's now define the linear regression model formally, and then we will dive deep into the theoretical and practical details. By definition, linear regression is a statistical or machine learning method that can help to model the impact of a unit change in the variable, the independent variable, on the values of another target variable or the dependent variable when the relationship between the two variables is assumed to be linear. When the linear regression model is based on a single independent variable, then we call this model simple linear regression. When the model is based on multiple independent variables, we call it multiple linear regression. Let's look at the mathematical expression describing linear regression. You can recall that when the linear regression model is based on a single independent variable, we just call it a simple linear regression. This expression that you see here is the most common mathematical expression describing simple linear regression. So you can see that we are seeing that the yi is equal to beta zero plus beta one xi plus ui. In this expression, the yi is a dependent variable. And the i that you see here is the index corresponding to the ith row. So whenever you are getting the data and you want to analyze this data, you will have multiple rows. And if your multiple rows describe the observations that you have in your data, so it can be people, it can be observation describing your data, then the ith characterizes the specific row, the ith row that you have in your data. And the yi is then the dependent variable's value corresponding to that ith row. Then the same holds for the xi. So the xi is then the independent variable or the explanatory variable or the regressor that you have in your model, which is the variable that we are testing. So we want to manipulate it to see whether this variable has a statistically significant impact on the dependent variable y. So we want to see whether unit change in the x will result in a specific change in the y and what kind of change is that. So beta zero that you see here is not a variable and it's called intercept or a constant. that is unknown so we don't have that in our data and it's one of the parameters of a linear regression. It's an unknown number which the linear regression model should estimate. So we want to use the linear regression model to find out this unknown value as well as the second unknown value which is a beta1 as well as we can estimate the error terms which are represented by the UI. So beta1 next to the xi so next to the independent variable is also not a variable. So like beta0 is an unknown parameter in linear regression model, an unknown number which the linear regression model should estimate. Beta1 is often referred as a slope coefficient of variable x which is the number that quantifies how much the dependent variable y will change if the independent variable x will change by one unit. So that's exactly what we are most interested in the beta1 because this is the coefficient and this is the unknown number that will help us to understand and answer the question whether our independent variable x has a statistically significant impact on our dependent variable y. Finally the u that you see here or the ui in the expression is the error term or the amount of mistake that the model makes when explaining the target variable. We add this value since we know that we can never exactly and accurately estimate the target variable. So we will always make some amount of estimation error and we can never estimate the exact value of y. Hence we need to account for this mistake that we are going to make and we know in advance that we are going to have this mistake by adding an error term to our model. Let's also have a brief look at how multiple linear regression is usually expressed in mathematical terms. So you might recall that the difference between the simple linear regression and multiple linear regression is that the first one has a single independent variable in it whereas the later so the multiple linear regression like the name suggests has multiple independent variables in it so more than one. Knowing this type of expression is critical since they not only appear a lot in the interviews but also in general you will see them in the data science blogs, in presentations, in books and also in papers. So being able to quickly identify and say ah I remember saying this at once then it will help you to easier understand and follow the process and the storyline. So what you see here you can read as yi is equal to beta 0 plus beta 1 times x1i plus beta 2 times x2i plus beta 3 times x3i plus ui. So this is the most common mathematical expression describing multiple linear regression in this case with three independent variables. So if you were to have more independent variables you should add them with their corresponding indices and coefficients. So in this case the method will aim to estimate the model parameters which are beta 0, beta 1, beta 2 and beta 3. So like before yi is our dependent variable which is always a single one so we only have one dependent variable. Then we have beta 0 which is our intercept or the constant. Then we have our first slope coefficient which is beta 1 corresponding to our first independent variable x1. Then we have x1i which stands for the independent variable the first independent variable with an index one and the i stands for the index corresponding to the row. So whenever we have multiple linear regression we always need to specify two indices and not only one like we had in our single linear regression. The index course characterizes which independent variable we are referring to so whether it's independent variable one, two or three and then we need to specify which row we are referring to which is the index i. So you might notice that in this case all the indices are the same because we are looking into one specific row and we are representing this row by using the dependent variables, the error term and dependent variable. So then we are adding our third term, which is beta two times X two I. So the beta two is our third unknown parameter in the model and the second slope coefficient corresponding to our second independent variable. And then we have our third independent variable with the corresponding slope coefficient beta three, as well as we also add, like always, an error term to account for the error that we know that we are going to make. So now when we know what the linear regression is and how to express it in the mathematical terms, you might be asking the next logical question. Well, we know that when we know what the linear regression is and how to express it in the mathematical terms, you might be asking the next logical question. How do we find those unknown parameters in the model in order to find out how the independent variables impacted the dependent variable? Finding these unknown parameters is called estimating in data science. And in general, so we are interested in finding out the possible values or the values that the best approximate the unknown values in our model. And we call this process estimation. And one technique used to estimate linear regression parameters is called OLS or ordinary least squares. So the main idea behind this approach, the OLS, is to find the best fitting straight line, so the regression line through a set of paired X and Ys, so our independent variables and dependent variables values by minimizing the sum of squared errors. So to minimize the sum of squares of the differences between the observed dependent variable and its values, which are the predicted values that we have predicted by our model, that's exactly what we want to do by using this linear function of the independent variables, the residuals. So this is too much information. Let's go step by step. So in linear regression, we just saw when we were expressing our simple linear regression, we had this error term. And we can never know what is the actual error term. But what we can do is to estimate the value of the error term, which we call residual. So we want to minimize the sum of squared residuals because we don't know the errors. So we want to find a line that will best fit our data in such a way that the error that we are making or the sum of squared errors is as small as possible. And since we don't know the errors, we can estimate the errors by each time looking at the predicted value that is predicted by our model and the true value. And then we can subtract them from each other and we can see how good our model is estimating the values that we have. So how good is our model estimating the unknown parameters? So to minimize the sum of squared of the differences between the observed dependent variable and its values predicted by the linear function of the independent variables. So the minimizing the sum of squared residuals. So we define the estimate of parameters and variables by adding a hedge on the top of the variables or parameters. So in this case, you can see that y i hat is equal to beta zero hat plus beta one hat x i. So you can see that we no longer have an error term in this. And we say that y i hat is the estimated value of y i. And beta zero hat is the estimated value of beta zero. Beta one hat is the estimated value of our beta one. And the x i is still our data. So the values that we have in our data and therefore we don't have a hat since that does not need to be estimated. So what we want to do is to estimate our dependent variable. And we want to compare our estimated value that we got using our OLS with the actual, with the real value. So we can calculate our errors or the estimate of the error which is represented by. represented by the UI head. So the UI head is equal to YI minus YI head, where UI head is simply the estimate of the error term or the residual. So this predicted error is always referred as residual. So make sure that you do not confuse the error with the residual. So error can never be observed. Error you can never calculate and you will never know. But what you can do is to predict the error. And you can, when you predict the error, then you get a residual. And what OLS is trying to do is to minimize the amount of error that it's making. Therefore, it looks at the sum of squared residuals across all the observation. And it tries to find the line that will minimize this value. Therefore, we are saying that the OLS tries to find the best fitting straight line such that it minimizes the sum of squared residuals. We have discussed this model when we were talking about this model mainly from the perspective of causal analysis in order to identify features that have a statistically significant impact on the response variable. But linear regression can also be used as a prediction model for modeling linear relationship. So let's refresh our memory with the definition of linear regression model. By definition, linear regression is a statistical or a machine learning method that can help to model the impact of a unit change in the variable, the independent variable on the values of another target variable, the dependent variable, when the relationship between two variables is linear. We also discussed how mathematically we can express what we call simple linear regression and a multiple linear regression. So this is how the simple linear regression can be represented. So in case of simple linear regression, you might recall that we are dealing with just a single independent variable. And we always have just one dependent variable both in a single linear regression and in the multiple linear regression. So here you can see that yi is equal to beta 0 plus beta 1 times xi plus ui, where y is a dependent variable and the i is basically the index of each observation or the row. And then the beta 0 is the intercept, which is also known as constant. And then the beta 1 is a sub-coefficient or a parameter corresponding to the independent variable x, which is a noun and a constant, which we want to estimate along to the beta 0. And then the xi is the independent variable corresponding to the observation i. And then finally, the ui is the error term corresponding to the observation i. Do keep in mind that this error term we are adding because we do know that we always are going to make a mistake and we can never perfectly estimate a dependent variable. Therefore, to account for this mistake, we are adding this ui. So let's also recall the estimation technique that we use to estimate the parameters of the linear regression model. So the beta 0 and beta 1. And to predict the response variable. So we call this estimation technique OLS or the ordinary least squares. And OLS is an estimation technique for estimating the unknown parameters in the linear regression model to predict the response or the dependent variable. So we need to estimate the beta 0. So we need to get the beta 0 hat. And we need to estimate the beta 1 or the beta 1 hat in order to obtain the yi hat. So yi hat is equal to beta 0 hat plus beta 1 hat times xi, where the difference between the yi hat and the yi, so the true value of the dependent variable and the predicted value, their difference will then produce our estimate of the error or what we also call residual. The main idea behind this approach is to find the best fitting straight line, so the regression line through a set of paired x and y values by minimizing the sum of squared residuals. So we want to minimize our errors as much as possible. Therefore, we are taking their squared version and we are trying to sum them up. And we want to minimize this entire error. So to minimize. sum of squared residuals, so the difference between the observed dependent variable and its values predicted by the linear function of the independent variables, we need to use the OLS. One of the most common questions related to linear regression that comes time and time again in the data science related interviews is the topic of the assumptions of the linear regression model. So you need to know each of these five fundamental assumptions of the linear regression and the OLS and also you need to know how to test whether each of these assumptions are satisfied. So the first assumption is the linearity assumption which states that the relationship between the independent variables and the dependent variable is linear. We also say that the model is linear in parameters. You can also check whether the linearity assumption is satisfied by plotting the residuals to the fitted values. If the pattern is non-linear then the estimates will be biased. In this case we say that the linearity assumption is violated and we need to use more flexible models such as tree-based models that we will discuss in a bit that are able to model these non-linear relationships. The second assumption in the linear regression is the assumption about randomness of the sample which means that the data is randomly sampled and which basically means that the errors or the residuals of the different observations in the data are independent of each other. You can also check whether the second assumption so this assumption about random sample is satisfied by plotting the residuals. You can then check whether the mean of these residuals is around zero and if not then the OLS estimates will be biased and the second assumption is violated. This means that you are systematically over or under predicting the dependent variable. The third assumption is the exogeneity assumption which is a really important assumption often asked during the data science interviews. Exogeneity means that each independent variable is uncorrelated with the error terms. Exogeneity refers to the assumption that the independent variables are not affected by the error term in the model. In other words the independent variables are assumed to be determined independently of the errors in the model. Exogeneity is a key assumption of linear regression models as it allows us to interpret the estimated coefficients as representing the true causal effects of the independent variables on the dependent variable. If the independent variables are not exogenous then the estimated coefficients may be biased and the interpretation of the results may be invalid. In this case we call this problem an endogeneity problem and we say that the independent variable is not exogenous but it's endogenous. It's important to carefully consider the exogeneity assumption when building a linear regression model as violation of this assumption can lead to invalid or misleading results. If this assumption is satisfied for an independent variable in the linear model we call this independent variable exogenous so otherwise we call it endogenous and we say that we have a problem of endogeneity. Endogeneity refers to the situation in which the independent variables in the linear regression model are correlated with the error terms in the model. In other words the errors are not independent of the independent variables. Endogeneity is a violation of one of the key assumptions of the linear regression model which is that the independent variables are exogenous or not affected by the errors in a model. Endogeneity can arise in a number of ways. For example it can be caused by omitted variable bias in which an important predictor of the dependent variable is not included in the model. It can also be caused by the reverse causality in which the dependent variable affects the independent variable. So those two are very popular examples of the case when we can get an endogeneity problem and those are things that you should know whenever you are interviewing for data science roles especially when it's related to machine learning because those questions are being asked to you in order to test whether you understand the concept of of exogeneity versus endogeneity, and also in which cases you can get endogeneity, and also how you can solve it. So in case of omitted variable bias, let's say you are estimating a person's salary, and you are using as independent variable their education, their number of years of experience, and some other factors. But you are not including, for instance, in your model, a feature that would describe the intelligence of a person, or, for instance, IQ of the person. Well, given that those are a very important indicator of a person in order to perform in their field, and this can definitely have indirect impact on their salary, not including these variables will result in omitted variable bias, because this will then be incorporated in your error term. And this can also relate to the other independent variables, because then your IQ is also related to the education that you have. Higher is your IQ, usually higher is your education. So in this way, you will have an error term that includes an important variable, so this is the omitted variable, which is then correlated with one of your, or multiple of your independent variables included in your model. So the other example, the other cause of the endogeneity problem is the reverse causality. And what reverse causality means is basically that not only the independent variable has an impact on the dependent variable, but also the dependent variable has an impact on the independent variable. So there is a reverse relationship, which is something that we want to avoid. We want to have our features that include in our model that have only an impact on the dependent variable. So they are explaining the dependent variable, but not the other way around, because if you have the other way, so you have the dependent variable impacting your independent variable, then you will have the error term being related to this independent variable, because there are some components that also define your dependent variable. So knowing the few examples such as those that can cause endogeneity, so they can violate the exogeneity assumption is really important. Then you can also check for the exogeneity assumption by conducting a formal statistical test. This is called House 1 test. So this is an econometrical test that helps to understand whether you have an exogeneity violation or not, but this is out of the scope of this course. I will, however, include many resources related to exogeneity, endogeneity, the omitted variable bias, as well as the reverse causality, and also how the House 1 test can be conducted. So for that, check out the inter-referential guide where you can also find the corresponding free resources. The fourth assumption in linear regression is the assumption about homoscedasticity. Homoscedasticity refers to the assumption that the variance of the errors is constant across all predicted values. This assumption is also known as the homogeneity of the variance. Homoscedasticity is an important assumption of linear regression model as it allows us to use certain statistical techniques and make inferences about parameters of the model. If the errors are not homoscedastic, then the result of these techniques may be invalid or misleading. If this assumption is violated, then we say that we have heteroscedasticity. Heteroscedasticity refers to the situation in which the variance of the error terms in a linear regression model is not constant across all the predicted values. So we have a variating variance. In other words, the assumption of homoscedasticity in that case is violated, and we say we have a problem of heteroscedasticity. Heteroscedasticity can be a real problem in linear regression analysis because it can lead to invalid or misleading results. For example, the standard error estimates and the confidence intervals for the parameters may be incorrect, which means that also the... statistical tests may have incorrect type one error rates. So you might recall when we were discussing the linear regression as part of the fundamental statistics section of this course, is that we looked into the output that comes from a Python and we saw that we are getting estimates as part of the output as well as standard errors, then the T-tests, so the student T-tests, and then the corresponding P-values and the 95% confidence intervals. So whenever there is a heteroscedasticity problem, the coefficient might still be accurate, but then the corresponding standard error, the student T-test, which is based on the standard error, and then the P-value as well as the confidence intervals may not be accurate. So you might get the good and reasonable coefficients, but then you don't know how to correctly evaluate them. You might end up discovering that, you might end up stating that certain independent variables are statistically significant because their coefficients are statistically significant since their P-values are small, but in the reality, those P-values are misleading because they are based on the wrong statistical tests and they are based on the wrong standard errors. You can check for this assumption by plotting the residuals and see whether there is a funnel-like graph. If there's a funnel-like graph, then you have a constant variance, but if there is not, then you won't see this funnel-like shape that indicates that your variances are constant. And if not, then we say we have a problem of heteroscedasticity. If you have a heteroscedasticity, you can no longer use the OLS and the linear regression. And instead, you need to look for other more advanced econometrical regression techniques that do not make such a strong assumption regarding the variance of your residuals. So you can, for instance, use the GLS, the FGLS, the GMM, and this type of solutions will help to solve the heteroscedasticity problem and they will not make a strong assumption regarding the variance in your model. The fifth and the final assumption in linear regression is the assumption about no perfect multicollinearity. This assumption states that there are no exact linear relationships between the independent variables. Multicollinearity refers to the case when two or more independent variables in your linear regression model are highly correlated with each other. This can be a problem because it can lead to unstable and unreliable estimate of the parameters in the model. Perfect multicollinearity happens when the independent variables are perfectly correlated with each other, meaning that one variable can be perfectly predicted from the other ones. And this can cause the estimated coefficient in your linear regression model to be infinite or undefined and can lead your errors to be entirely misleading when making predictions using this model. If perfect multicollinearity is detected, it may be necessary to remove one, if not more, problematic variables such that you will avoid having correlated variables in your model. And even if the perfect multicollinearity is not present, multicollinearity at a higher level can still be a problem if the correlations between the independent variables are high. In this case, the estimate of the parameters may be imprecise and the model may be entirely misleading and will result in less reliable predictions. So to test for the multicollinearity assumption, you have different solutions, you have different options. The first way you can do that is by using the Dicke-Fuller test. Dicke-Fuller test is a formal statistical and econometrical test that will help you to identify which variables cause a problem and whether you have a perfect multicollinearity in your linear regression model. You can plot a heat map, which will be based on the... a correlation matrix corresponding to your features. Then you will have your correlations per pair of independent variables plotted as a part of your heat map and then you can identify all the pair of features that are highly correlated with each other and those are problematic features one of which should be removed from your model. And in this way by showing the heat map you can also showcase your stakeholders why you have removed certain variables from your model whereas explaining the Dukey-Fuller test is much more complex because it involves more advanced econometrics and linear regression explanation. So if you're wondering how you can perform this Dukey-Fuller test and you want to prepare the questions related to perfect multicollinearity as well as how you can solve the perfect multicollinearity problem in your linear regression model, then head towards the interview preparation guide included in this part of the course in order to answer such questions and also to see the 30 most popular interview questions you can expect from this section in the interview preparation guide. Now let's look into an example coming from the linear regression in order to see how all those pieces of the puzzle come together. So let's say we have collected the data on the class size and a test course for a sample of students and we want to model the linear relationship between the class size and the test course using a linear regression model. So as we have just one independent variable we are dealing with a simple linear regression and the model equation would be as follows. So you can see that the test course is equal to beta 0 plus beta 1 multiplied by class size plus epsilon. So here the class size is the single independent variable that we got in our model. The test score is the dependent variable. The beta 0 is the intercept or the constant. The beta 1 is the coefficient of interest as this is the coefficient corresponding to our independent variable and this will help us to understand what is the impact of a unit change in the class size on the test score. And then finally we are including in our model an error term to account for the mistakes that we are definitely going to make when estimating the dependent variable the test scores. The goal is to estimate the coefficient beta 0 and beta 1 from the data and use the estimated model to predict the test scores based on the class size. So once we have the estimates we can then interpret them as follows. The y-intercept, the beta 0, represents the expected test scores when the class size is zero. It represents the base score that the student would have obtained if the class size would have been zero. Then the coefficient for the class size, the beta 1, represents the change in the test scores associated with the one unit change in the class size. The positive coefficient would imply that one unit change in the class size would increase the test scores whereas the negative coefficient would imply that the one unit change in the class size will decrease the test scores correspondingly. We can then use this model with OLS estimates in order to predict the test scores for any given class size. So let's go ahead and implement that in Python. If you're wondering how this can be done then head towards the resources section as well as the part of the Python for Data Science where you can learn more about how to work with pendants data frames, how to import the data, as well as how to fit a linear regression model. So the problem is as follows. We have collected the data on the class size and we have this independent variable. So as you can see here we have the students underscore data and then we have the class size and the salvo feature and then we want to estimate the y which is the test score. So here is the code, a sample code that will fit a linear regression model. We are keeping here everything very simple. We are not splitting our data. into train and test and then fitting the model on the training data and making the predictions with the test score but we just want to see how we can interpret the coefficients so keeping everything very simple. So you can see here that we are getting an intercept equal to 63.17 and the coefficient corresponding to our single independent variable class size is equal to minus 0.40. What this means is that so each increase of the class size by one unit will result in the decrease of the test scores with 0.4. So there is a negative relationship between the two. Now the next question is whether there is statistical significance, whether the coefficient is actually significant and whether the class size has actually statistically significant impact on the dependent variable but all those are things that we have discussed as part of the fundamental statistics section of this course as well as we are going to look into a linear regression example when we are going to discuss the hypothesis testing. So I would highly suggest you to stop in here to revisit the fundamental statistics section of this course to refresh your memory in terms of linear regression and then check also the hypothesis testing section of the course in order to look into a specific example of linear regression when we are discussing the standard errors, how you can evaluate your OLS estimation results, how you can use the student t-test, the p-value and the confidence intervals and how you can estimate them. In this way you will learn for now only the theory related to the coefficients and then you can add on the top of this theory once you have learned all the other sections and the other topics in this course. Let's finally discuss the advantages and the disadvantages of the linear regression model. So some of the advantages of the linear regression model are the following. The linear regression is relatively simple and easy to understand and to implement. Linear regression models are well suited for understanding the relationship between a single independent variable and a dependent variable. Also linear regression can help you handle multiple independent variables and can estimate the unique relationship between each independent variable and the corresponding dependent variable. Linear regression model can also be extended to handle more complex models such as polynomials, interaction terms, allowing for more flexibility in the modeling the data. Also linear regression model can be easily regularized to prevent overfitting which is a common problem in modeling as we saw in the beginning of this section. So you can use for instance root regression which is an extension of linear regression. You can use Lausser regression which is also an extension of linear regression model. And then finally linear regression models are widely supported by software packages and libraries making it easy to implement and to analyze. And some of the disadvantages of the linear regression are the following. So the linear regression models make a lot of strong assumptions regarding for instance the linearity between independent variables and the dependent variables. While the true relationship can actually be also non-linear. So the model will not then be able to capture the complexity of the data so the non-linearity and the predictions will be inaccurate. Therefore it's really important to have a data that has a linear relationship for linear regression to work. Linear regression also assumes that the error terms are normally distributed and also homoscedastic. Error terms are independent across observations. Violations of the strong assumptions will lead to biased and inefficient estimates. Linear regression is also sensitive to outliers which can have a disproportionate effect on the estimate of the regression coefficients. Linear regression does not easily handle categorically independent variables which often require additional data preparation or the use of indicator variables or use of encodings. Finally, linear regression also assumes that the independent variables are exogenous and not affected by the error terms. If this assumption is violated, then the result of the model may be misleading. In this lecture, lecture number 5, we will discuss another simple machine learning technique called logistic regression, which is a simple but very important classification model useful when dealing with a problem where the output should be a probability. So the name regression in logistic regression might be confusing, since this is actually a classification model. Logistic regression is widely used in a variety of fields such as social sciences, medicine and engineering. So let us firstly define the logistic regression model. The logistic regression is a supervised classification technique that models the conditional probability of an event, occurring, or observation belonging to a certain class, given a data set of independent variables, and those are our features. The class can have two categories or more, but later on we will learn that logistic regression works ideally when we have just two classes. This is another very important and very popular machine learning technique, which though named logistic regression, is actually a supervised classification technique. So when the relationship between two variables is linear, the dependent variable is a categorical variable, and you want to predict the variable in the form of a probability, so a number between 0 and 1, then logistic regression comes in very handy. This is because during the prediction process in logistic regression, the classifier predicts the probability, a value between 0 and 1, of each observation belonging to a certain class. For instance, if you want to predict the probability or the likelihood of a candidate being elected or not elected during the election process, given the set of characteristics that you got about your candidate, let's say the popularity score, the past successes, and other descriptive variables about this candidate, then logistic regression comes in very handy to model this probability. So rather than predicting the response variable, logistic regression models the probability that y belongs to a particular category. Similar to the linear regression, with a difference that instead of y, it predicts the log odds. So we will come about this definition of log odds and odds in a bit. In statistical terminology, what we are trying to do is to model the conditional distribution of the response y, given the predictors x. Therefore, logistic regression helps to predict the probability of y belonging to a certain class, given the feature space, what we call probability of y given x. If you are wondering what is the concept of probability, what is this conditional probability, then make sure to head towards the section of fundamental statistics as we are going to in detail about this concept as well as we are looking into different examples. These definitions and these concepts will help you to better follow this lecture. So here we see the probability x, which is what we are interested in modeling, and it's equal to e to the power beta 0 plus beta 1 times x divided to 1 plus e to the power beta 0 plus beta 1 times x. Let's now look into the formulas for the odds and log odds. Both these formulas are really important because you can expect them during your data science interviews. So sometimes you will be asked to explicitly write down the odds and log odds formulas. And those are highly related to the log-likelihood and likelihood functions, which are the base for the estimation technique MLE or the maximum likelihood estimation used to estimate the unknown parameters in the logistic regression. So the log odds and the odds are highly related to each other, and in logistic regression, we use the odds and log odds to describe the probability of an event occurring. The odds is the ratio of the probability of an event occurring to the probability of the event not occurring. So as you can see, the odds is equal to px divided by 1 minus px, where px is the probability of the event occurring, and 1 minus px is the probability of the event not occurring. So this formula is equal to e to the power beta 0 plus beta 1 times x in our formula, where we only have one independent variable. And the e, simply, is the Euler's number, or the 2.72, which is a constant. So we won't derive this formula by ourselves, because that's out of the scope of this course, but feel free to head out to the px formula that we just saw in the previous slide and take this formula divided to 1 minus 2, exactly the same expression, and you can verify that you will end up with this expression that you see here. For example, if the probability of a person having a heart attack is 0.2, then the odds of having a heart attack will be 0.2 divided to 1 minus 0.2, which is equal to 0.25. The log odds, also known as the logit function, is a natural logarithm of the odds. So as you can see here, the log of px divided to 1 minus px, and this is equal to beta 0 plus beta 1 times x. So you can see that we are getting rid of this e. And this is simply because of a mathematical expression that says if we take the log of the e to the power something, then we end up with only the exponent part in it. Though this is out of the scope of this course to look into the mathematical derivation of these formulas, I will include many resources regarding this logarithm, the transformations, and the mathematics behind it, just in case you want to look into those details and do some extra learning. So logistic regression uses the log odds as the dependent variable, and the independent variables are used to predict this log odds. The coefficients of the independent variables represent then the change in the log odds for a one-unit change in the independent variable. So you might recall that in the linear regression we were modeling the actual dependent variable. In case of logistic regression, the difference is that we are modeling the log odds. Another important concept in logistic regression is the likelihood function. The likelihood function is used to estimate the parameters of the model given the observed data. Sometimes during the interviews you might also be asked to write down the exact likelihood formula or the log likelihood function. So I would definitely suggest you to memorize this one and to understand all the components included in this formula. The likelihood function describes the probability of the observed data given the parameters of the model. And if you followed the lecture of the probability density functions in the section of Fundamentals to Statistics, you might here even recognize the Bernoulli PDF, since the likelihood function here is based on the probability mass function of a Bernoulli distribution, which is a distribution of a binary outcome. So this is highly applicable to the case where we have only two categories in our dependent variable, and we are trying to estimate the probability of observation belonging to one of those two classes. So this is the log likelihood function and this is the likelihood function. We start with the likelihood function and the L, the capital letter L, stands for the likelihood function. The L is equal to the likelihood function L is equal to product across all pair of these multipliers. So we have px i to the power y i multiplied by 1 minus px i to the power 1 minus y i, where px i is the px that we just saw only for observation i, and the yi is the yi that we just saw only for observation i. is simply the class. So Yi will either be equal to zero or one. So Yi is equal to one, then one minus Yi is equal to zero. So every time we are looking into the probability of observation belonging to the first class, multiply by the probability of observation not belonging to that class. And we take this cross multiplications. Now we do that for all the observations that are included in our data. And this also comes from mathematics. So this stands for the product. So given that it's harder to work with products compared to the sums, we then apply the log-likelihood transformation in order to obtain the log-likelihood function instead of likelihood function. So when we apply this log transformation, so we take the logarithm of this expression, we end up with this log-likelihood expression. And here again, one more time, we are making use of a mathematical property which says that if we take the logarithm of the products, we end up with the sum of the logarithms. So we go from the products to the sums. I will also include resources regarding this such that you can also learn the mathematics behind these transformations. So the log-likelihood with a lowercase l is equal to logarithm of the products, p Xi to the power Yi multiplied by one minus p Xi to the power one minus Yi. And when we apply that mathematical transformation, then the l is equal to sum across all observation, i is equal to one till m and then Yi. So the power, the exponent comes to the front, Yi multiplied by logarithm of the p Xi plus one minus Yi multiplied by logarithm of one minus p Xi. While for linear regression, we use OLS as estimation technique, for logistic regression, another estimation technique should be used. The reason why we cannot use OLS in logistic regression to find the best fitting line is because the errors can become very large or very small and sometimes even negative in case of logistic regression. While for logistic regression, we aim for predicted value between zero and one. Therefore, for logistic regression, we need to use estimation technique called maximum likelihood estimation, or in short MLE, where the likelihood function calculates the probability of observing the data outcome given the input data in the model. We just saw the likelihood function in the previous slide. This function is then optimized to find a set of parameters that results in the largest sum likelihood, so the maximum likelihood over the training dataset. The logistic function will always produce this S-shaped curve, regardless of the value of independent variable X, resulting in sensible estimation most of the time, so value between zero and one. So as you can see, this S-shaped curve is what characterizes the maximum likelihood estimation corresponding to the logistic regression, and it will always provide an outcome between zero and one. Then the idea behind the maximum likelihood estimation is to find a set of estimates that would maximize the likelihood function. So let's go through the maximum likelihood estimation step by step. What we need to do first is to define the likelihood function. The first step is to always define this function for the model. Secondly, we need to write the log-likelihood function. So the next step is to take the natural logarithm of the likelihood function to obtain the log-likelihood function. So I'm talking about this one. The log-likelihood function is a more convenient and computationally efficient function to work with. And what we need to do next is to find the maximum of this log-likelihood function. So this step consists of finding the values of the... parameters, beta0 and beta1, that maximize the log-likelihood function. There are many optimization algorithms that can be used to find the maximum, but these are out of the scope of this course and you don't need to know them as part of becoming a data scientist and entering the data science field. In the fourth step, we need to estimate the parameters, so we are talking about the beta0 and beta1. Once the maximum of the log-likelihood function is found, the values of the parameters that correspond to the maximum are considered the maximum likelihood estimates of the parameters. And then in the next step, we need to check the model fit. So once the maximum likelihood estimates are obtained, we can check the goodness of fit of the model by calculating information criteria such as AIC, BIC or R-squared, where AIC stands for Akaika's information criteria, BIC stands for Bayesian information criteria, and R-squared refers to the same evaluation value that we use for evaluating linear regression. In the final step, we need to make predictions and evaluate the model. Using the maximum likelihood estimates, the model can be used to make predictions on new unseen data, and the performance of the model can be then evaluated using various evaluation metrics such as accuracy, precision, and recall. Those are metrics that we have revisited as part of the very initial lecture in this section, and those are metrics that you need to know. So unlike the AIC, BIC that we just spoke about that evaluates the goodness of fit of the very initial estimates that come from the maximum likelihood, the accuracy and precision and the recall evaluate the final model, so the values that we get for the new unseen data when we make the predictions and we get the classes. And those are metrics that you need to know. If you are wondering what is accuracy, what is precision, recall is, as well as the F1 score, make sure to head towards the very initial lecture in this section where we talked about the exact definition of these metrics. Let's finally discuss the advantages and the disadvantages of logistic regression. So some of the advantages of logistic regression are that it's a simple model, it has a low variance, it has a low bias, and it provides probabilities. Some of the disadvantages of logistic regressions are logistic regression is unable to model nonlinear relationship. So one of the key assumptions that logistic regression is making is that there is a linear relationship between your independent variable and your dependent variable. Logistic regression is also unstable when your classes are well separable. As well as logistic regression becomes very unstable when you have more than two classes. So this means whenever you have more than two categories in your dependent variable or whenever your classes are well separable, using logistic regression for classification purposes will not be very smart. So instead, you should look for other models that you can use for this task. And one of such models is linear discriminant analysis, so the LDA, that we will introduce in the next lecture. So this is all for this lecture, where we have looked into the logistic regression and the maximum likelihood estimation. In the next lecture, we will look into the LDA. So stay tuned and I will see you in the next lecture. Looking to step into machine learning or data science? It's about starting somewhere practical yet powerful. And that's the simple yet most popular machine learning algorithm, linear regression. Linear regression isn't just a jargon, it's a tool that is used both for finding out what are the most important features in your data, as well as being used to forecast the future. That's your starting point. in the journey of data science and hands-on machine learning work. Embark on a hands-on data science and machine learning project where we are going to find what are the drivers of Californian house prices. You will clean the data, visualize the key trends, you will learn how to process your data and how to use different Python libraries to understand what are those drivers of Californian house values. You are going to learn how to implement linear regression in Python and learn all these fundamental steps that you need in order to conduct a proper hands-on data science project. At the end of this project, you will not only learn those different Python libraries when it comes to data science and machine learning such as Pandas, Scikit-learn, Statsmodels, Medfordlib, Seaborn but you will also be able to put this project on your personal website and on your resume. Appoint sites, step-by-step case study and approach to build your confidence and expertise in machine learning and in data science. In this part, we are going to talk about a case study in the field of predictive analytics and causal analysis. So we are going to use this simple yet powerful regression technique called linear regression in order to perform causal analysis and predictive analytics. So by causal analysis, I mean that we are going to look into these correlations, causation, and we're trying to figure out what are the features that have an impact on the housing price, on the house value. So what are these features that are describing the house that define and cause the variation in the house prices? The goal of this case study is to practice linear regression model and to get this first feeling of how you can use machine learning model, a simple machine learning model, in order to perform model training, model evaluation, and also use it for causal analysis where you are trying to identify features that have a statistically significant impact on your response variables, on your dependent variable. So here is the step-by-step process that we are going to follow in order to find out what are the features that define the Californian house values. So first we are going to understand what are the set of independent variables that we have. We're also going to understand what is the response variable that we have, so for our multiple linear regression model. We are going to understand what are these techniques that we need and what are the libraries in Python that we need to load in order to be able to conduct this case study. So first we are going to load all these libraries and we are going to understand why we need them. Then we are going to conduct data loading and data preprocessing. This is a very important step and I deliberately didn't want you to skip this and didn't want to give you the clean data because usually in normal real hands-on data science job, you won't get a clean data. You will get a dirty data which will contain missing values, which will contain outliers, and those are things that you need to handle before you proceed to the actual part which is the modeling and the analysis. So therefore we are going to do missing data analysis. We are going to remove the missing data from our Californian house price data. We are going to conduct outlier detection. So we are going to identify outliers. We are going to learn different techniques that you can use, visualization techniques in Python that you can use in order to identify outliers and then remove them from your data. Then we are going to perform data visualization. So we are going to explore the data and we are going to do different plots to learn more about the data. to learn more about these outliers and different statistical techniques combined with Python. So then we are going to do correlation analysis to identify some problematic features, which is something that I would suggest you to do independent the nature of your case study to understand what kind of variables you have, what is the relationship between them and whether you are dealing with some potentially problematic variables. So then we will be moving towards the fun part, which is performing the multiple linear regression in order to perform the causal analysis, which means identifying the features in the Californian house blocks that define the value of the Californian houses. So finally, we'll do very quickly another implementation of the same multiple linear regression in order to give you not only one, but two different ways of conducting linear regression because linear regression can be used not only for causal analysis, but also as a standalone, a common machine learning regression type of model. Therefore, I will also tell you how you can use scikit-learn as a second way of training and then predicting the Californian house values. So without further ado, let's get started. Once you become a data scientist or machine learning researcher or machine learning engineer, there will be some cases, some hands-on data science projects where the business will come to you and will tell you, well, here we have this data and we want to understand what are these features that have the biggest influence on this artifactor. In this specific case, in our case study, let's assume we have a client that is interested in identifying what are the features that define the house price. So maybe it's someone who wants to invest in houses. So it's someone who is interested in buying houses and maybe even renovating them and then reselling them and making a profit in that way. Or maybe in the long-term investment market when people are buying real estate in a way of investing in it and then holding it for a long time and then selling it later, or for some other purposes. The end goal in this specific case for a person is to identify what are these features of the house that makes this house to be priced at a certain level. So what are the features of the house that are causing the price and the value of the house? So we are going to make use of this very popular data set that is available on Kaggle. And it's originally coming from Scikit-learn and it's called California Housing Prices. I'll also make sure to put the link of this specific data set both in my GitHub account under this repository that will be dedicated for this specific case study, as well as I will also point out the additional links that you can use to learn more about this data set. So this data set is derived from 1990 US Census, so United States Census, using one row per census block. So a block group, or block, is the smallest geographical unit for which the US Census Bureau publishes sample data. So a block group typically has a population of 600 to 3000 people who are living there. So a household is a group of people residing within a single home. Since the average number of rooms and bedrooms in this data set are provided per household, this cost may take surprisingly large values for block groups. with few households and many empty houses, such as vacation resorts. So, let's now look into the variables that are available in this specific dataset. So, what we have here is the MedInc, which is the median income in block group. So, this touches the financial side and financial level of the block, block of households. Now we have house age. So, this is the median house age in the block group. Then we have average rooms, which is the average number of rooms per household. Then we have average bedroom, which is the average number of bedrooms per household. Then we have population, which is the block group population. So, that's basically like we just saw, that's the number of people who live in that block. Now we have average occup, which is basically the average number of household members. Then we have latitude and longitude, which are the latitude and longitude of this block group that we are looking into. So, as you can see here, we are dealing with aggregated data. So, we don't have the data per household, but rather the data is calculated and averaged, aggregated based on a block. So, this is very common in data science, when we want to reduce the dimension of the data and when we want to have some sensible numbers and create this cross-sectional data. And cross-sectional data means that we have multiple observations for which we have data on a single time period. In this case, we are using as an aggregation unit the block. And we have already learned as part of the theory lectures, this idea of median. So, we have seen that there are different descriptive measures that we can use in order to aggregate our data. One of them is the mean, but the other one is the median. And oftentimes, especially if we are dealing with skewed distribution, so if we have a distribution that is not symmetric, but it's rather right skewed or left skewed, then we need to use this idea of median, because median is then better representation of this scale of the data compared to the mean. And in this case, we will soon see when representing and visualizing this data, that we are indeed dealing with skewed data. So, this basically a very simple, a very basic data set with not too many features. So, great way to get your hands on with actual machine learning use case. We will be keeping it simple, but yet we will be learning the basics and the fundamentals in a very good way, such that learning more difficult and more advanced machine learning models will be much more easier for you. So, let's now get into the actual coding part. So, here I will be using the Google Cloud. So, I will be sharing the link to this notebook combined with the data in my Python for Data Science repository, and you can make use of it in order to follow this tutorial with me. So, we always start with importing libraries. We can run a linear regression manually without using libraries by using matrix multiplication, but I would suggest you not to do that. You can do it for fun or to understand this matrix multiplication, the linear algebra behind the linear regression. But if you want to get hands-on and understand how you can use linear regression like you expect to do it on your day-to-day job, then do it. expect to use instead libraries such as scikit-learn, or you can also use the statsmodels.api libraries. In order to understand this topic, and also to get hands-on, I decided to showcase this example not only in one library, in scikit-learn, but also the statsmodels. And the reason for this is because many people use linear regression just for predictive analytics. And for that, using scikit-learn, this is the go-to option. But if you want to use linear regression for causal analysis, so to identify and interpret these features, the independent variables that have a statistically significant impact on your response variable, then you will need to use another library, a very handy one for linear regression, which is called statsmodels.api. And from there, you need to import the SM functionality. And this will help you to do exactly that. So later on, we will see how nicely this library will provide you the outcome, exactly like you will learn on your traditional econometrics or introduction to linear regression class. So I'm going to give you all this background information, like no one before. And we're going to interpret and learn everything such that you start your machine learning journey in a very proper and in a very high-quality way. So in this case, first thing we are going to import is the pandas library. So we are importing pandas library as PD and then NumPy library as NP. We are going to need pandas just to create a pandas data frame to read the data and then to perform data wrangling to identify the missing data, outliers, so common data wrangling and data-proposing steps. And then we are going to use NumPy. And NumPy is a common way to use whenever you are visualizing data or whenever you are dealing with matrices or with arrays. So pandas and NumPys are being used interchangeably. So then we are going to use matplotlib, and specifically the PyPlatform it. And this library is very important when you want to visualize data. Then we have Seaborn, which is another handy data visualization library in Python. So whenever you want to visualize data in Python, then matplotlib and Seaborn, they are two very handy data visualization techniques that you must know. If you like this cooler undertone of colors, Seaborn will be your go-to option, because then the visualizations that you are creating are much more appealing compared to the matplotlib. But the underlying way of working, so plotting, scatterplot, or lines, or heatmap, they are the same. So then we have the statsmodels.api, which is the library from which we will be importing the SM. That is the Kintool linear regression model that we will be using for our causal analysis. Here I'm also importing the from scikit-learn linear model, and specifically the linear regression model. And this one is basically similar to this one. You can use both of them. But it is a common way of working with machine learning model. So whenever you are dealing with predictive analytics, you are using the data not for identifying features that have a statistically significant impact on the response variable, so features that have an influence and are causing the dependent variable, but rather you are just interested to use the data to train the model. the model on this data and then test it on an unseen data, then you can use Scikit-learn. So Scikit-learn will be something that you will be using not only for linear regression, but also for other machine learning models. Think of Canon, logistic regression, random forest, decision trees, boosting techniques such as light GBM, GBM, also clustering techniques like k-means, DBSCAN, anything that you can think of that fits in this category of traditional machine learning model, you'll be able to find in Scikit-learn. Therefore, I didn't want to limit this tutorial only to the SAS models, which we could do if we wanted to have this case study specifically for linear regression, which we are doing. But instead, I wanted to showcase also this usage of Scikit-learn because Scikit-learn is something that you can use beyond linear regression, so for all these other type of machine learning models. And given that this course is designed to introduce you to the world of machine learning, I thought that we will combine this also with Scikit-learn, something that you are going to see time and time again when you are using Python combined with machine learning. So then I'm also importing the train and test split from the Scikit-learn model selection such that we can split our data into train and test. Now, before we move into the actual training and testing, we need to first load our data. So therefore, what I did was to here, in this sample data, so in a folder in Google Cloud, I put this housing.csv data. That's the data that you can download when you go to this specific page. So when you go here, then you can also download here that data, so download 409 gigabytes of this housing data. And that's exactly what I'm downloading and then uploading here in Google Cloud. So this housing.csv in this folder, so I'm copying the path and I'm putting it here, and I'm creating a variable that holds this name, so the path of the data. So the file underscore path is the string variable that holds the path of the data. And then what I need to do is that I need to take this file underscore path and I need to put it in the pd.read underscore csv, which is a function that we can use in order to load the data. So pd stands for pandas, the short way of naming pandas. pd. and then read underscore csv is the function that we are taking from pandas library. And then within the parentheses, we are putting the file underscore path. If you want to learn more about this basics of variable, different data structure, some basic Python for data science, then to ensure that we are keeping this specific tutorial structured, I will not be talking about that. But feel free to check the Python for data science course, and I will put the link in the comments below, such that you can learn that if you don't know yet. And then you can come back to this tutorial to learn how you can use Python in combination with linear regression. So the first thing that I tend to do before moving on to the actual execution stage is to look into the data, to perform data exploration. So what I tend to do is to look at the data fields, so the name of the variables that are available in the data. And that you can do by doing data.columns. So you will then look into the columns in your data. This will be the name of your data fields. So let's go ahead and do a command enter. So we see that we have longitude, latitude, housing underscore median age. We have total rooms, we have total bedrooms, population. So basically the amount of people who are living in those households and in those houses. Then we have households. Then we have median income. We have median house underscore value. And we have ocean proximity. Now you might notice that the name of these variables are a bit different than in the actual documentation of the California house. So you see here the naming is different, but the underlying explanation is the same. So here they are just trying to make it nicer and represented in a better naming. But it is a common thing to see in Python when we are dealing with data that we have this underscores in the name abbreviation. So we have housing underscore median age, which in this case, you can see that it says house age. So a bit different, but their meaning is the same. This is still the median house age in the block group. So one thing that you can also notice here is that in the official documentation, we don't have this one extra variable that we have here, which is the ocean proximity. And this basically describes the closeness of the house from the ocean, which of course for some people can definitely mean increase or decrease in the house price. So basically we have all these variables. And next thing that I tend to do is to look into the actual data. And one thing that we can do is just to look at the top 10 rows of the data instead of printing the entire data frame. So when we go and execute this specific part of the code and the command, you can see that here we have the top 10 rows of our data. So we have the longitude, the latitude, we have the housing median age. You can see we are seeing some 41 year, 21 year, 52 year. Basically the number of years that a house, the median age of the house is 41, 21, 52, and this is per block. Now we have the number of total bedrooms. So we see that we have in this block the total number of rooms that these houses have is 7,099. So we are already seeing a data that consists of these large numbers, which is something to take into account when you are dealing with machine learning models, and especially with linear regression. Then we have total bedrooms, and we have then population, households, median income, median house value, and the ocean proximity. One thing that you can see right off the bat is that we have longitude and latitude, which have some unique characteristics, and longitude is with minuses, latitude is with pluses, but that's fine for linear regression because what it is basically looking is whether a variation in certain independent variables, in this case, longitude and latitude, whether that will cause a change in the dependent variable. So just to refresh our memory what this linear regression will do in this case. So we are dealing with multiple linear regression because we have more than one independent variables. So we have as independent variable. those different features that describe the house, except of the house price. Because median house value is the dependent variable. So that's basically what we are trying to figure out. We want to see what are the features of the house that cause, so define the house price. We want to identify what are the features that cause a change in our dependent variable. And specifically, what is the change in our median house price value if we apply a one unit change in our independent feature? So if we have a multiple linear regression, we have learned during the theory lectures that what linear regression tries to use during causal analysis is that it tries to keep all the independent variables constant and then investigate for a specific independent variable, what is this one unit change increase in the specific independent variable will result in what kind of change in our dependent variable? So if we, for instance, change by one unit our housing median age, then what will be the corresponding change in our median household value keeping everything else constant? So that's basically the idea behind multiple linear regression and using that for this specific use case. And in here, what we also want to do is to find out what are the data types and whether we can learn a bit more about our data before proceeding to the next step. And for that, I tend to use this info function in pandas. So given that the data is a pandas data frame, I will just do data.info and then parentheses, and then this will show us what is the data type and what is the number of null values per variable. So as we have already noticed from this header, which we can also see here being confirmed that ocean proximity is a variable that is not a numeric value. So here you can see near bay as a value for that variable, which unlike all the other values is represented by a string. So this is something that we need to take into account because later on when we will be doing the data processing and we will actually run this model, we will need to do something with this specific variable. We need to process it. So for the rest, we are dealing with numeric variables. So you can see here that longitude, latitude, all the other variables, including our dependent variable is a numeric variable, so float 64. The only variable that needs to be taken care of is this ocean underscore proximity, which we can actually later on also see that it's categorical string variable. And what this basically means is that it has these different categories. So for instance, let us actually do that in here very quickly. So let's see what are all the unique values for this variable. So if we take the name of this variable, so we copied it from this overview in here and we do unique, then this should give us the unique values for this categorical variable. So here we go. So we have actually five different unique values for this categorical string variable. So this means that this ocean proximity can take five different values and it can be either nearby, it can be less than one hour from the ocean, it can be inland, it can be. near ocean and it can be in the Iceland. What this means is that we are dealing with a feature that describes the distance of the block from the ocean. And here the underlying idea is that maybe this specific feature has a statistically significant impact on the house value. Meaning that it might be possible that for some people in certain areas or in certain countries, living in the nearby the ocean will be increasing the value of the house. So if there is a huge demand for houses which are near the ocean, so people prefer to live near the ocean, then most likely there will be a positive relationship. If there is a negative relationship, then it means that people, if in that area, in California for instance, people do not prefer to live near the ocean, then we will see this negative relationship. So we can see that if we increase the, if people, if the house is in the area that is not close to ocean, so further from the ocean, then the house value will be higher. So this is something that we want to figure out with this linear regression. We want to understand what are the features that define the value of the house. And we can say that if the house has those characteristics, then most likely the house price will be higher or the house price will be lower. And linear regression helps us to not only understand what are those features, but also to understand how much higher or how much lower will be the value of the house if we have the certain characteristics. And if we increase the certain characteristics then we can say that we are increasing the value of the house by one unit. So next we are going to look into the missing data in our data. So in order to have a proper machine learning model, we need to do some data processing. So for that, what we need to do is we need to check for the missing values in our data. And we need to understand what is this amount of null values per data field. And we need to remove some of those missing values, or we need to do imputation. So depending on the amount of missing data that we got in our data, we can then understand which of those solutions we need to take. So here we can see that we don't have any null values when it comes to longitude, latitude, housing, median, age, and all the other variables, except a one variable, one independent variable. And that's the total bedrooms. So we can see that out of all the observations that we got, the total underscore bedrooms variable has 207 cases when we do not have the corresponding information. So when it comes to representing these numbers in percentages, which is something that you should do as your next step, we can see that out of the entire dataset for total underscore bedrooms variable, only one point, no null at 3% is missing. Now this is really important because by simply looking at the number of times, the number of missing observations per data fields, this won't be helpful for you because you will not be able to understand relatively how much of the data is missing. Now, if you have for a certain variable, 50% missing or 80% missing, then it means that for majority of your house blocks, you don't have that information. And including that will not be beneficial for your model, nor will be it accurate to include it. And it will result in bias. model, because if you have for the majority of observations, no information, and for certain observations you do that, you have that information, then you will automatically skew your results and you will have biased results. Therefore, if you have for the majority of your data set, that specific variable missing, then I will suggest you to just to drop that independent variable. In this case, we have just 1% of the house blocks missing that information, which means that this gives me confidence that I would rather keep this independent variable and just to drop those observations that do not have a total underscore bedrooms information. Now another solution could also be is to, instead of dropping that entire independent variable, is just to use some sort of imputation technique. So what this means is that we will try to find a way to systematically find a replacement for that missing value. So we can use mean imputation, median imputation, or more model-based, more advanced statistical or econometrical approaches to perform imputation. So for now, this is out of the scope of this problem, but I would say look at the percentage of observations for which this independent variable has missing values. If this is low, like less than 10%, and you have a large data set, then you should be comfortable dropping those observations. But if you have a small data set, so you got only 100 observations and for them, like 20% or 40% is missing, then consider for imputation. So try to find the values that can be used in order to replace those missing values. Now once we have this information and we have identified the missing values, the next thing is to clean the data. So here what I'm doing is that I'm using the data that we got, and I'm using the function dropNA, which means drop the observations where the value is missing. So I'm dropping all the observations for which the total underscore bedrooms has a null value. So I'm getting rid of my missing observations. So after doing that, I'm checking whether I got rid of my missing observations. And you can see here that when I'm printing data.isNull.sum, so I'm summing up the number of missing observations, null values per variable, then now I no longer have any missing observations. So I successfully deleted all the missing observations. Now the next stage is to describe the data through some descriptive statistics and through data visualization. So before moving on towards the causal analysis or predictive analysis, in any sort of machine learning, traditional machine learning approach, try to first look into the data. Try to understand the data and see whether you are seeing some patterns. What is the mean of different numeric data fields? Do you have certain categorical values that cause an unbalanced data? Those are things that you can discover early on before moving on to the model training and testing and blindly believing to the numbers. So data visualization techniques and data exploration are a great way to understand this data that you got before using that in order to train and test machine learning model. So here I'm using the traditional data model. describe function of pandas, so data.describe parentheses. And then this will give me the descriptive statistics of my data. So here, what we can see is that, in total, we got 20,640 observations. And then we also have a mean of all the variables. So you can see that per variable, I have the same count, which basically means that for all variables, I have the same number of rows. And then here, I have the mean, which means that here, we have the mean of the variables. So per variable, we have their mean, and then we have their standard deviation, so the square root of the variance. We have the minimum. We have the maximum. But we also have the 25th percentile, the 15th percentile, and the 75th percentile. So the percentile and quantiles, those are statistical terms that we often use. And the 25th percentile is the first quantile. The 15th percentile is the second quantile, or the median. And the 75th percentile is the third quantile. So what this basically means is that this percentiles help us to understand what is the threshold when it comes to looking at the observations that fall under the 25th percent and then above the 25th percent. So when we look at this standard deviation, standard deviation helps us to interpret the variation in the data at the unit scale of that variable. So in this case, the variable is median house value. And we have that the mean is equal to 206,000 approximately, so more or less that range, 206k. And then the standard deviation is 115k. What this means is that in the data set, we will find blocks that will have the median house value that will be 206k, 206k plus 115k, which is around 321k. So there will be blocks where the median house value is around 321k. And there will also be blocks where the median house value will be around 91k, so 206,000 minus 115k. So there's the idea behind standard deviation, this variation in your data. So next, we can interpret the idea of this minimal and the maximum of your data and your data fields. The minimum will help you to understand what is this minimum value that you have per data field, numeric data field, and what is the maximum value. So what is the range of values that you are looking into? In case of the median house value, this means what is this minimum median house value per block? And in case of maximum, what is this highest value per block when it comes to the median house value? So this can help you to understand when we look at this aggregated data. So the median house value, what are the blocks that have the cheapest houses when it comes to your evaluation, and what are the most expensive blocks of houses? So we can see that the cheapest block where in that block, the median house value is 15k, so 14,999. And the house block with the highest valuation when it comes to the median house value, so the median valuation of the houses is equal to $500,001, which means that when. and we look at our blocks of houses, that the median house value in this most expensive blocks will be a maximum 500K. So next thing that I tend to do is to visualize the data. I tend to start with the dependent variable. So this is the variable of interest, the target variable or the response variable, which is in our case, the median house value. So this will serve us as our dependent variable. And what I want you to do is to upload this histogram in order to understand what is the distribution of median house values. So I want to see that when looking at the data, what are the most frequently appearing median house values? And what are this type of blocks that have unique, less frequently appearing median house values? By plotting this type of plots, you can see some outliers, some frequently appearing values, but also some values that go and are lying outside of the range. And this will help you to identify and learn more about your data and to identify outliers in your data. So in here, I'm using the Seaburn library. So given that earlier I already imported this libraries, there is no need to import here. What I'm doing is that I'm setting the grid. So which basically means that I'm saying the background should be white, and I also want this grid. So this means this grid behind. Then I'm initializing the size of the figure. So PLT, this comes from matplotlib pyplot. And then I'm setting the figure. The figure size should be 10 by six. So this is the 10 and this is the six. Then we have the main plot. So I'm using the histplot function from Seaburn. And then I'm taking from the clean data, so from which we have removed the missing data, I'm picking the variable of interest, which is the median house value. And then I'm saying, upload this histogram using the forest green color. And then I'm saying the title of this figure is distribution of median house values. Then I'm also mentioning what is the x label, which basically means what is the name of this variable that I'm putting on the x-axis, which is the median house value. And what is the y label? So what is the name of the variable that I need to put on the y-axis? And then I'm saying plt.show, which means show me the figure. So that's basically how in Python the visualization works. We first need to write down the actual figure size. And then we need to set the function in the right variable. So provide data to the visualization. Then we need to put the title, we need to put the x label, y label, and then we need to say, show me the visualization. And if you want to learn more about this visualization techniques, make sure to check the Python for Data Science course, because that one will help you to understand slowly and in detail how you can visualize your data. So in here, what we are visualizing is the frequency of this median house values in the entire dataset. What this means is that we are looking at the number of times each of those median house values appear in the dataset. So we want to understand are there certain median house values that appear very often? And are there certain house values that do not appear that often? So those can be maybe considered outliers. Because we want in our data only to keep those most relevant and representative data points. We want to derive conclusions that hold for the majority of our observations and not for outliers. We will be then using that representative data in order to run our linear regression and then make conclusions. When looking at this graph, what we can see is that we have a certain cluster of median house values that appear quite often. And those are the cases when this frequency is high. So you can see that we have, for instance, houses in here in all this block that appear very often. So for instance, the median house value of about 160, 170K, this appears very frequently. So you can see that the frequency is above 1000. Those are the most frequently appearing median house values. And there are cases when the, so you can see in here, and you can see in here, houses whose median house value is not appearing very often. So you can see that their frequency is low. So roughly speaking, those houses, they are unusual houses. They can be considered as outliers. And the same holds also for these houses, because you can see that for those, the frequency is very low, which means that in our population of houses, so Californian house prices, you will most likely see houses, blocks of houses, whose median value is between, let's say, 17K up to, let's say, 300 or 350K. But anything below and above this is considered as unusual. So you don't often see houses that are, so house blocks that have a median house value less than 70 or 60K, and then also houses that are above 370 or 400K. So do consider that we are dealing with 1990 year data, and not the current prices, because nowadays Californian houses are much more expensive, but this is the data coming from 1990. So do take that into account when interpreting this type of data visualizations. So what we can then do is to use this idea of interquantile range to remove these outliers. What this basically means is that we are looking at the lowest 25th percentile. So we are looking at this first quantile, so 0.25, which is a 25th percentile. And we are looking at this upper 25th percent, which means the third quantile or the 75th percentile. And then we want to basically remove those. By using this idea of 25th percentile and 75th percentile, so the first quantile and the third quantile, we can then identify what are the observations, or the blocks that have a median house value that is below the 25th percentile and above the 75th percentile. So basically we want to get the middle part of our data. So we want to get this data for which the median house value is above the 25th percentile. So above all the median house values that is above the lowest 25th percent. And then we also want to remove this very large median house values. So we want to keep in our data the so-called normal and representative blocks. Blocks where the median house value is above. of the lowest 25% and smaller than the largest 25%. What we are using is this statistical term called interquantile range. You don't need to know the name, but I think it would be just worth to understand it because this is a very popular way of making a data-driven removal of the outliers. So I'm selecting the 25th percentile by using the quantile function from pandas. So I'm saying find for me the value that divides my entire block of observations or block observations to observations for which the median house values below the 25th percentile and above the 25th percentile. So what are the largest 75% and what are the smallest 25% when it comes to the median house value and we will then be removing this 25th percent. So that I will do by using this Q1. And then we will be using the Q3 in order to remove the very large median house values, so the upper 25th percentile. And then in order to calculate the interquantile range, we need to pick the Q3 and subtract from it the Q1. So just to understand this idea of Q1 and Q3, so the quantiles better, let's actually print this Q1 and this Q3. So let's actually remove this part for now and then run it. So as you can see here, what we are finding is that the Q1, so the 25th percentile or first quantile is equal to $119,500. So it basically is a number in here. What it means is that we have 25% of the observations, the smallest observations have a median house value that is below the $119,500. And the remaining 75% of our observations have a median house value that is above the $190,500. And then the Q3, which is the third quantile or the 75th percentile, it describes this threshold, the volume, where we make a distinction between the lowest median house values, the first 75th percent of the lowest median house values versus the most expensive, so the highest median house values. So what is this upper 25% when it comes to the median house value? So we see that that distinction is $264,700. So it is somewhere in here, which basically means that when it comes to these blocks of houses, the most expensive ones with the highest valuation, so the 25% top rated median house values, they are above $264,700. That's something that we want to remove. So we want to remove the observations that have a smallest median house value and the largest median house values. And usually it's a common practice when it comes to the interquantile range approach to multiply the interquantile range by 1.5 in order to obtain the lower bound and the upper bound. So to understand what are the thresholds that we need to use. use in order to remove the blocks, so observations from our data, where the median house value is very small or very large. So for that, I will be multiplying the IQR, so interquartile range, by 1.5. And when we subtract this value from Q1, then we will be getting our lower bound. When we will be adding this value to Q3, then we will be using and getting this threshold when it comes to the upper bound. And we will be seeing that after we clean these outliers from our data, we end up getting smaller data. So this means that previously we had 20K, so 20,433 observations, and now we have 19,369 observations. So we have roughly removed about 1,000 or a bit over 1,000 observations from our data. So next, let's look into some other variables, for instance, the median income. And one other technique that we can use in order to identify outliers in the data is by using the box plot. So I wanted to showcase these different approaches that we can use in order to visualize the data and to identify outliers, such that you will be familiar with different techniques. So let's go ahead and plot the box plot. And box plot is a statistical way to represent your data. The central box represents the interquartile range. So that is the IQR, and with the bottom and the top edges, they indicate the 25th percentile, so the first quantile, and the 75th percentile, so the third quantile, respectively. The length of this box that you see here, this dark part, is basically the 50% of your data for the median income. And this median line inside this box, this is the one with contrasting color. That represents the median of the data set. So the median is the middle value when data is sorted in an ascending order. Then we have these whiskers in our box plot, and this line of whiskers extends from the top and the bottom of the box and indicate this range for the rest of the data set, excluding the outliers. They are typically this 1.5 IQR above and 1.5 times IQR below the Q1, something that we also saw just previously when we were removing the outliers from the median house volume. So in order to identify the outliers, you can quickly see that we have all these points that lie above the 1.5 times IQR above the third quantile, so the 75th percentile. And that's something that you can also see here, and this means that those are blocks of houses that have unusually high median income. That's something that we want to remove from our data, and therefore we can use exactly the same approach that we used previously for the median house value. So we will then identify the 25th percentile or the first quantile, so Q1, and then Q3, so the third quantile or the 75th percentile. Then we will compute the IQR, and then we'll be obtaining the lower bound and the upper bound using this 1.5 as a scale, and then we'll be using that, this lower bound and upper bound, to then use these filters in order to remove from the data all the observations where the median income is above the lower bound. lower bound, and all the observations that have a median income below the upper bound. So we are using lower bound and upper bound to perform double filtering. We are using two filters in the same row, as you can see. And we are using this parenthesis and this end functionality to tell to Python, well, first look that this condition is satisfied. So the observations have a median income that is above this lower bound. And at the same time, it should hold that the observation, so the block, should have a median income that is below the upper bound. And if this block, this observation in the data satisfies two of these criteria, then we are dealing with a good point, a normal point, and we can keep this. And we are saying that this is our new data. So let's actually go ahead and execute this code. In this case, we can see two high. Because all our outliers lie in this part of the boxplot. And then we will end up with the clean data. I'm taking this clean data, and then I'm putting it under data just for simplicity. And this data now is much more clean, and it's better representation of the population, something that ideally we want. Because we want to find out what are the features that describe and define the house value, not based on this unique and rare houses, which are too expensive, or which are in the blocks that have very high income people. But rather we want to see the true representation, so the most frequently appearing data. What are the features that define the house value of the prices for common houses and for common areas, for people with average or with normal income? This is what we want to find. So the next thing that I tend to do when it comes to especially regression analysis and causal analysis is to plot the correlation heat map. So this means that we are getting the correlation matrix, pairwise correlation score, for each of this pair of variables in our data. When it comes to the linear regression, one of the assumptions of the linear regression that we learned during the theory part is that we should not have a perfect multicollinearity. What this means is that there should not be a high correlation between pair of independent variables. So knowing one should not help us to automatically define the value of the other independent variable. And if the correlation between these two independent variables is very high, it means that we might potentially be dealing with multicollinearity. That's something that we do want to avoid. So heat map is a great way to identify whether we have this type of problematic independent variables and whether we need to drop any of them or maybe multiple of them to ensure that we are dealing with proper linear regression model and the assumptions of linear regression model is satisfied. Now we will look at this correlation heat map. And here we use the Seabourn in order to plot this. As you can see here, the colors can be from very light, so white, from till very dark green, where the light means there is a negative, strong negative correlation. And very dark green means that there is a very strong positive correlation. So we know that correlation of value, Pearson correlation, can take values between minus one and one. Minus one means very strong negative correlation, one means very strong positive correlation. And usually when we are dealing with correlation of the variable with itself, so a correlation between longitude and longitude, then this correlation is equal to one. So as you can see on the diagonal, we have therefore all the ones because those are the pairwise correlation of the variables with themselves. And then in here, all the values under the diagonal are actually equal to the mirror of them in the upper diagonal because the variable between, so the correlation between the same two variables independent of how we put it, so which one we put first and which one the second is going to be the same. So basically correlation between longitude and latitude and correlation latitude and longitude is the same. So now we have refreshed our memory on this, let's now look into the actual number and this heat map. So as we can see here, we have this section where we have variables, independent variables that have a low positive correlation with the remaining independent variables. So you can see here that we have this light green values which indicate a low positive relationship between pair of variables. One thing that is very interesting here is the middle part of this heat map where we have this dark numbers. So the numbers below the diagonals are something we can interpret. And remember that below diagonal and above diagonal is basically the mirror. We here already see a problem because we are dealing with variables which are going to be independent variables in our model that have a high correlation. Now, why is this a problem? Because one of the assumptions of linear regression, likely so during the theory section is that we should not have a multiple collinearity. So multicollinearity problem. When we have perfect multicollinearity, it means that we are dealing with independent variables that have a high correlation. Knowing a value of one variable will help us to know automatically what is the value of the other one. And when we have a correlation of 0.93, which is very high or 0.98, this means that those two variables, those two independent variables, they have a super high positive relationship. This is a problem because this might cause our model to result in very large standard errors and also not accurate and not generalizable model. That's something we want to avoid. And we want to ensure that the assumptions of our model are satisfied. Now, we are dealing with independent variable, which is total underscore bedrooms and households, which means that number of total bedrooms per block and the households is highly correlated, positively correlated. And this is a problem. So ideally, what we want to do is to drop one of those two independent variables. And the reason why we can do that is because those two variables, given that they are highly correlated, they already explain similar type of information. So they contain similar type of variation, which means that including the two, just it doesn't make sense. On one hand, it's violating the model assumptions potentially. And on the other hand, it's not even adding too much volume because the other one already shows similar variation. So the total underscore bedrooms basically contains similar type of information as the households. So we can better just drop one of those two independent variables. Now, the question is which one? And that's something that we can define by also looking at other correlations in here. Because we have total bedrooms having a high correlation with households, but we can also see that the total underscore rooms has a very high correlation with our households. So this means that there is yet another independent variable that has a high correlation with our households variable. And then this total underscore rooms has also high correlation with the total underscore bedroom. So this means that we can decide which one has more frequently high correlation with the rest of independent variables. And in this case, it seems like that the largest two numbers in here are this one and this one. So we see that the total bedroom has a 0.93 as correlation with the total underscore rooms. And at the same time, we also see that total bedrooms has also very high correlation with the household, so 0.98, which means that total underscore bedrooms has the highest correlation with the remaining independent variables. So we might as well drop this independent variable. But before you do that, I would suggest you do one more quick visual check, and it is to look into the total underscore bedroom correlation with the dependent variable to understand how strong of a relationship does this have on the response variable that we are looking into. So we see that the total underscore bedroom has this one, 0.05 correlation with the response variable, so the median house value. When it comes to the total rooms, that one has much higher. So I'm already seeing from here that we can feel comfortable excluding and dropping the total underscore bedroom from our data in order to ensure that we are not dealing with perfect multicollinearity. So that's exactly what I'm doing here. So I'm dropping the total bedrooms. So after doing that, we no longer have this total bedrooms as the column. So before moving on to the actual causal analysis, there is one more step that I wanted to show you, which is super important when it comes to the causal analysis and some introductory econometrical stuff. So when you have a string categorical variable, there are a few ways that you can deal with them. One easy way that you will see on the web is to perform one-hot encoding, which basically means transforming all this string values. So we have a near bay, less than one hour ocean, inland, near ocean, Iceland, to transform all these values to some numbers such that we have for the ocean proximity variable values such as one, two, three, four, five. One way of doing that can be something like this. But better way when it comes to using this type of variables in linear regression is to transform this string category type of variable to what we're calling dummy variables. So dummy variable means that this variable takes two possible values, and usually it is a binary Boolean variable, which means that it can take two possible values, zero and one, where one means that the condition is satisfied and zero means condition is not satisfied. So let me give you an example. In this specific case, we have that the ocean proximity has five different values. And ocean proximity is just a. single variable. Then what we will do is we will use the get underscore dummies function in Python from pandas in order to go from this one variable to five different variable per each of this category. Which means that now we'll have new variables that will basically be whether it is near bay or not, whether it's less than one hour from the ocean variable, whether it's inland, whether it's near ocean, or whether it's an island. This will be a separate binary variable, a dummy variable, that will take values zero and one. Which means that we are going from one string categorical variable to five different dummy variables. And in this case, each of those dummy variables that you can see here, we are creating five dummy variables each of each for each of those five categories. And then we are combining them. And from the original data, we will then be dropping the ocean proximity data. So on one hand, we are getting rid of this string variable, which is a problematic variable for linear regression, when combined with the scikit-learn library, because scikit-learn cannot handle this type of data when it comes to linear regression. And B, we are making our job easier when it comes to interpreting the results. So interpreting linear regression for causal analysis is much more easy when we have dummy variables than when we have a one string categorical variable. So just to give you an example, if we are creating from this string variable five different dummy variables, and those are those five different dummy variables that you can see in here. So this means that if we are looking at this one category, so let's say ocean underscore proximity under inland, it means that for all the rows where we have the value equal to zero, it means this criteria is not satisfied, which means that ocean proximity underscore inland is equal to zero, which means that the house block we are dealing with is not from inland. So that criteria is not satisfied. And otherwise, if this value is equal to one, so for all these rows, when the ocean proximity inland is equal to one, it means that the criteria is satisfied, and we are dealing with house blocks that are indeed in the inland. One thing to keep in mind when it comes to transforming a string categorical variable to a set of dummies is that you always need to drop at least one of the categories. And the reason for this is because we learned during the theory that we should have no perfect multicollinearity. This means that we cannot have five different variables that are perfectly correlated. And if we include all these values and these variables, it means that when we know that the block of houses is not near the bay, is not less than one hour ocean, is not inland, is not near the ocean, automatically we know that it should be the remaining category, which is inland. So we know that for all those blocks, the ocean proximity underscore island, Iceland will be equal to one. And that's something that we want to avoid because that is the definition of perfect multicollinearity. So to avoid one of the OLS assumptions to be violated, we need to drop one of those categories. So we can see in here, that's exactly what I'm doing. I'm saying, so let's go ahead. and actually drop one of those variables. So let's see first what is the set of all variables we got. So we got less than one hour ocean, inland, Iceland, new bay, and then a new ocean. Let's actually drop one of them. So let's drop the Iceland. And that we can do very simply by, let me see, C, Iceland, allowing me to add a code in here. So we are doing data is equal to, and then data.drop, and then the name of the variable within the quotation marks, and then x is equal to one. So in this way, I'm basically dropping one of the dummy variables that I created. In order to avoid the perfect multi-coloniality assumption to be violated. And once I go ahead and print the columns, now we should see this column disappearing. Here we go. So we successfully deleted that variable. Let's go ahead and actually get the head. So now you can see that we no longer have a string in our data, but instead we got four additional binary variable out of a string categorical variable with five categories. All right, now we are ready to do the actual work. When it comes to the training of machine learning model or statistical model, we learned during the theory that we always need to split that data into train and test set. That is the minimum. In some cases, we also need to do train validation and test that we can train the model on the training data and then optimize the model on the validation data and find out what is the optimal set of hyperparameters. And then use this information to apply this fitted and optimized model on an unseen test data. We are going to skip the validation set for simplicity, especially given that we are dealing with a very simple machine learning model as linear regression. And we're going to split our data into train and test. And here, what I'm going to do is first, I'm creating this list of the name of variables that we are going to use in order to train our machine learning model. So we have a set of independent variables and a set of dependent variable. So in our multiple linear regression, here is the set of independent variables that we will have. So we have longitude, latitude, housing, median edge, total rooms, population, households, median income, median house value, and the four different dummy variables that we built from the categorical variable. Then I am specifying that the target variable is, so the target, so the response variable or the dependent variable is the median house value. This is the value that we want to target because we want to see what are the features and what are the independent variables out of this set of all features that have a statistically significant impact on the dependent variable, which is median house value. Because we want to find out what are these features describing the houses in the block that cause a change, cause a variation in the target variable such as the median house value. So here we have x is equal to, and then from the data, we are taking all the features that have the following names and then we have the target, which is the median house value, and that's the column that we are going to subtract. and select from the data. So we are doing data filtering. So here we are then selecting, and what I'm using here is the trainTestSplit function from the scikit-learn. So illogical that in the beginning, we spoke and imported this model selection library, and from the scikit-learn model selection, we imported the train underscore test underscore split function. Now this is the function that you are going to need quite a lot in machine learning, because this is a very easy way to split your data. So in here, the arguments of this function is first the matrix or the data frame that contains the independent variables, in our case, x. So here you fill in x, and then the second argument is the dependent variable, so the y, and then we have testSize, which means what is the proportion of observations that you want to put in the test, and what is the proportion of observation that you don't want to put basically in the training. If you are putting 0.2, it means that you want your testSize to be 20% of your entire 100% of data, and the remaining 80% will be your training data. So if you provide 0.2 to this argument, then the function automatically understands that you want this 80-20 division, so 80% training and then 20% testSize. And then finally, you can also add the random state because the splitting is going to be random, so the data is going to be randomly selected from the entire data, and to ensure that your results are reproducible, and the next time you are running this notebook, you will get the same results, and also to ensure that me and you get the same results, we'll be using a random state. And the random state of 1, 1, 1 is just a random number that I liked and decided to use here. So when we go and use this and run this command, you can see that we have a training setSize 15K and then testSize 38K. So when you look at these numbers, you will then get a verification that you are dealing with 20% versus 80% thresholds. So then we go and we do the training. One thing to keep in mind is that here we are using the SM library, an SM function that we imported from the statsmodels.api. So this is one function that we can use in order to conduct our causal analysis and to train a linear regression model. So for that, what we need to do, when we are using this library, this library doesn't automatically add the first column of ones in your set of independent variables, which means that it only goes and looks at what are the features that you have provided, and those are all the independent variables. But we learned from the theory that when it comes to linear regression, we always are adding this intercept, so the beta zero. If you go back to the theory lectures, you can see this beta zero to be added to both to the simple linear regression and to the multiple linear regression. This ensures that we look at this intercept and we see what is this average, in this case, median house value if all the other features are equal to zero. So therefore, given that the specific statsmodels.api is not adding this constant column to the beginning for intercept, it means that we need to add this manually. Therefore, we are saying SMW. add underscore constant to the X train, which means that now our X table or X data frame adds a column of ones to the features. So let me actually show you before doing the training, because I think this is also something that you should be aware of. So if we do here a pause, so I'm going to do X underscore train underscore constant. And then I'm also going to print the same feature data frame before adding this constant, such that you see what I mean. So as you can see here, this is just the same set of all columns that form the independent variables, the features. So then when we add the constant, now after doing that, you can see that now we have this initial column of ones. This is done such that we can have a V del zero at the end, which is the intercept, and we can then perform a valid multiple linear regression. Otherwise, you don't have an intercept, and this is just not what you're looking for. Now, the scikit-learn library does this automatically. Therefore, when you are using this TutsModels.API, you should add this constant, and then use the scikit-learn without adding the constant. And if you're wondering why to use this specific model, as we already discussed about this, just to refresh your memory, we are using this TutsModels.API because this one has this nice property of visualizing the summary of your results. So your p-values, your t-tests, your standard errors, something that you definitely are looking for when you are performing a proper causal analysis, and you want to identify the features that have a statistically significant impact on your dependent variable. If you are using a machine learning model, including linear regression, only for predictive analytics, so in that case, you can use the scikit-learn without worrying about using TutsModels.API. So this is about adding a constant. Now we are ready to actually fit our model or train our model. Therefore, what we need to do is to use sm.ols. So ols is the ordinary least squares estimation technique that we also discussed as part of the theory. And we need to provide first the dependent variable, so y underscore train, and then the feature set, which is x underscore train underscore constant. So then what we need to do is to do dot fit parenthesis, which means that take the ols model and use the y underscore train as my dependent variable and x underscore train underscore constant as my independent variable set, and then fit the ols algorithm and linear regression on this specific data. If you're wondering why y train or x train and what is the difference between train and test, ensure to go and revisit the training theory lectures because there I go in detail into this concept of training and testing and how we can divide the data into train and test. And this y and x, as we have already discussed during this tutorial, is simply this distinction between independent variables defined by x and the dependent variable defined by y. So y train y test is the dependent variable data for the training data and test data. And then x train x test is simply the training data features, so x train, and then test data features x test. We need to use x train and y train to fit our data, to learn from the data. And then once it comes down to evaluating the model, we need to use the fitted model from which we have learned using both the dependent variable. and the independent variable set, so y-train, x-train. And then once we have this model that is fitted, we can apply this to an unseen data, x underscore test. We can obtain the predictions, and we can compare this to the true y, so y underscore test, and to see how different the y underscore test is from the y predictions for this unseen data, and to evaluate how model is performing this prediction. So how model is managing to identify the median house values and predict median house values based on the fitted model and on an unseen data, so x underscore test. So this is just a background info and some refreshment. And now, in this case, we are just fitting the data on the training dependent variable and then training independent variable at a constant, and then we are ready to print the summary. Now, let's now interpret those results. First thing that we can see is that all the coefficients and all the independent variables are statistically significant. And how can I see this? Well, if we look in here, we can see the column of p-values. This is the first thing that you need to look at when you are getting these results of a causal analysis and linear regression. So here, we are seeing that the p-value is very small. And just to refresh our memory, p-value says, what is this probability that you have obtained too high of a test statistics, given that this is just by random chance? So you are seeing statistically significant results, which is just by random chance, and not because your null hypothesis is false and you need to reject it. So that's one thing. In here, you can see that we are getting much more. So first thing that you can do is to verify that you have used a correct dependent variable. So you can see here that the dependent variable is a median house value. The model that is used to estimate those coefficients in your model is the OLS. The method is the least squares. So least squares is simply the technique that is the underlying approach of minimizing the sum of squared residuals, so the least squares. The date that we are running this analysis is the 26th of January of 2024. So we have the number of observations, which is the number of training observations, so the 80% of our original data. We have R-squared, which is the matrix that showcases what is the goodness of fit of your model. So R-squared is a matrix that is commonly used in linear regression specifically to identify how good your model is able to fit your data with this linear regression line. And the R-squared, the maximum of R-squared is one, and the minimum is zero. 0.58, in this case, approximately 59, it means that all your data that you got and all your independent variables, so those are all the independent variables that you have included, they are able to explain 59%, so 0.59, out of the entire set of variation. So 59% of variation in your response variable, which is the median house value, you are able to explain with a set of independent variables that you have provided to the model. Now, what does this mean? On one hand, it means that you have a reasonable enough information, so anything above 0.5 is quite good, which means that more than half of the entire variation in your median house value you are able to explain, but on the other hand, it means also that there is approximately 40% of variation, so information about your house values that you don't have in your data. This means that you might consider and looking for extra additional information, so additional independent variables, to add on the top of the existing independent variables in order to increase this amount and to increase the amount of information and variation that you are able to explain with your model. So the R-squared, this is like the best way to explain what is the quality of your regression model. Another thing that we have is the adjusted R-squared. Adjusted R-squared and R-squared in this specific case, as you can see, they are the same, so 0.59. This usually means that you are fine when it comes to amount of features that you are using. Once you overwhelm your model with too many features, you will notice that the adjusted R-squared will be different than your R-squared. So adjusted R-squared helps you to understand whether your model is performing well only because you are adding too many of those variables or because really they contain some useful information. Because sometimes the R-squared, it will automatically increase just because you are adding too many independent variables. But in some cases, those independent variables, they are not useful, so they are just adding to the complexity of the model and possibly overfitting your model, but not providing any added information. Now we have the F-statistics here, which corresponds to the F-test. And F-test, it comes from statistics. You don't need to know it, but I would say check out the Fundamentals to Statistics course if you do want to know it, because it means that you are testing whether all these independent variables all together, whether they are helping to explain your dependent variables, so the median house value. And if the F-statistics is very large, or the p-value of your F-statistics is very small, so 0.00, it means that all your independent variables jointly are statistically significant, which means that all of them together helped you explain your median house value and have a statistically significant impact on your median house value, which means that you have a good set of independent variables. So then we have the log-likelihood. Not super relevant in this case, but you have the AIC, BIC, which stand for Archaic Information Criteria and Bayesian Information Criteria. Those are also not necessary to know for now, but once you advance in your career in machine learning, it might be useful to know at higher level. For now, think of it like a value that helps you understand this information that you gain when you are adding this set of independent variables to your model. But this is just optional. Ignore it if you don't know it for now. Okay, let's now go into the fun part. So in this meta part of the summary table, we got first the set of independent variables. So we have our constant, which is the intercept. We have the longitude, latitude, housing, median age, total rooms, population, households, median income, and the four dummy variables that we have created. Then we have the coefficients corresponding to those independent variables. Those are basically the beta zero, beta one hat, beta two hat, et cetera, which are the parameters of the linear regression model that our OLS method has estimated based on the data that we have provided. Now, before interpreting this independent variables, the first thing you need to do, as I mentioned in the beginning, is to look at this p-value column. This showcases the set of all independent variables that are statistically significant. And usually this table that you will get from statsmodels.api is at a 5% significance level. So the alpha, the threshold of statistical significance, is equal to 5%. And any p-value that is smaller than 0.05, it means you are- dealing with statistically significant independent variable. Now, the next thing that you can see here in the left is the t-statistics. This p-value is based on a t-test. So this t-test is simply stating, as we have learned during the theory, and you can also check the Fundamentals to Statistics course from Lunar Tech for more detailed understanding of this test. But for now, this t-test states a hypothesis whether each of these independent variables individually has a statistically significant impact on the dependent variable. And whenever this t-test has a p-value that is smaller than the 0.05, it means you are dealing with statistically significant independent variable. In this case, we are super lucky, all our independent variables are statistically significant. Then the question is, whether we have a positive statistically significant or negative? That's something that you can see by the signs of these numbers. So you can see that longitude has a negative coefficient, latitude negative coefficient, housing median age positive coefficient, et cetera. Negative coefficient means that this independent variable causes a negative change in the dependent variable. So more specifically, when we look, for instance, the, let's say, which one should we look? Let's say the total underscore rooms. When we look at the total underscore rooms and it's minus 2.67, it means that when we look at this total number of rooms and we increase the number of rooms by one additional unit, so one more room added to the total underscore rooms, then the house value decreases by minus 2.67. Now you might be wondering, but how is this possible? Well, first of all, the value, the coefficient is quite small. So on one hand, it's not super relevant as we can see. The relationship between them is not super strong because the margin of this coefficient is quite small. But on the other hand, you can explain that at some point when you are adding more rooms, it just doesn't add any value. And in fact, in some cases, it just decreases the value of the house. This might be the case. At least this is the case based on this data. We can see that if there is a negative coefficient, then one unit increase in that specific independent variables or else constant will result in, in this case, for instance, in case of the total rooms, 2.67 decreased in the median house value, everything else constant. We are also referring to this asset that is paribus in econometrics, which means that everything else constant. So one more time, let's refresh our memory on this. So ensure that we are clear on this. If we add one more room to the total number of rooms, then the median house value will decrease by $2.67. And this, when the longitude, latitude, house median age, population, households, median income, and all the other criterias are the same. So if we have, for instance, this negative value, this means that we are getting a decrease in the median house value if we have an increase by one unit in our total number of rooms. Now let's look at the opposite case when the coefficient is actually positive and large, which is the housing median age. This means is if we have two houses, they have exactly the same characteristics. So they have the same longitude, latitude, they have the same total number of rooms, population, households, median income. They are the same. in terms of the distance from the ocean, then if one of these houses has one more additional year added on the median age, so housing median age, so it's one year older, then the house value of this specific house is higher by $846. $846. So this house, which has one more additional median age, has $846 higher median house value compared to the one that has all these characteristics except it has just the house median age that is one year less. So one more additional year in the median age will result in $846 increase in the median house value, everything else constant. So this is regarding this idea of negative and positive and then the margin of coefficient. Now let's look at one dummy variable and explain the idea behind it and how we can interpret it. And it's a good way to understand how these dummy variables can be interpreted in the context of linear regression. So one of the independent variables is the ocean proximity inland. And the coefficient is equal to minus 2.108E plus 0.5. This simply means minus 210K approximately. And what this means is that if we have two houses, they have exactly the same characteristics, so their longitude, latitude is the same, house median age is the same, they have the same total number of rooms, population, households, median income, all these characteristics for these two blocks of houses is the same with a single difference that one block is located in the inland when it comes to ocean proximity and the other block of houses is not located in the inland. So in this case, the reference, so the category that we have removed from here was the Iceland geometrical. So if the block of houses is in the inland, that their value is on average smaller and less by 210K when it comes to the median house value compared to the block of houses that has exactly the same characteristics, but it's not in the inland. So for instance, it's in the Iceland. So when it comes to these dummy variables where there is also an underlying referenced variable which you have deleted as part of your string categorical variable, then you need to reference your dummy variable to that specific category. This might sound complex. It is actually not. I would say it's just a matter of practicing and trying to understand what is this approach of dummy variable. It means that you either have that criteria or not. In this specific case, it means that if you have two blocks of houses with exactly the same characteristics and one block of houses is in the inland and the other one is not in the inland, for instance, is in the Iceland, then the block of houses in the inland will have on average 210,000 less median house value compared to the block of houses that is in the, for instance, in the Iceland when it comes to the ocean proximity, which kind of makes sense because in California, people might prefer living in the Iceland location and the houses might have more demand when it comes to the Iceland location compared to the inland locations. So the longitude has a statistically significant impact on the median house value. Latitude. House median age has an impact and causes a statistically significant difference in the median house value if there is a change in the median age. The total number of rooms have an impact on the median house volume and the population has an impact, households median income as well as the proximity from the ocean. And this is because all their p-values is zero, which means that they are smaller than 0.05. And this means that they all have a statistically significant impact on the median house value in the Californian housing market. Now when it comes to the interpretation of all of them, we have interpreted just few for the sake of simplicity and ensuring that this entire case study doesn't take too long. But what I would suggest you to do is to interpret all of the coefficients here. Because we have interpreted just the housing median age and the total number of rooms, but you can also interpret the population as well as the median income. And we have also interpreted one of those dummy variables, but feel free also to interpret all the other ones. So by doing this, you can also even build an entire case study paper, which you can explain in one or two pages the results that you have obtained. And this will showcase that you have an understanding of how you can interpret the linear regression results. Another thing that I would suggest you to do is to add a comment on the standard error. So let's now look into the standard errors. We can see a huge standard error that we are making. And this is the direct result of the fourth assumption that was violated. Now this case study is super important and useful in a way that it showcases what happens if some of your assumptions are satisfied and if some of those assumptions are violated. So in this specific case, the assumption related to the errors having a constant variance is violated. So we have a heteroscedasticity issue. And that's something that we are seeing back in our results. And this is a very good example of the case that even without checking the assumptions, you can already see that the standard error is very large. And you can see here that given that the standard error is large, this already gives a hint that most likely our heteroscedasticity is present and our homoscedasticity assumption is violated. Do keep in mind this idea of large standard errors that we just saw because we are going to see that this becomes a problem also for the performance of the model. And we will see that we are obtaining a large error due to this. And one more comment when it comes to the total rooms and the housing median age. In some cases, the linear regression results might not seem logical, but sometimes there actually is an underlying explanation that can be provided. Or maybe your model is just overfitting or biased, that's also possible. And that's something that you can do by checking your OLS assumptions. And before going to that stage, I wanted to briefly showcase to you this idea of predictions. So we have now fitted our model on the training data, and we are ready to perform the predictions. So we can then use our fitted model, and we can then use the test data, so X test, in order to perform the predictions. Also, to use that data to get new house... median house values for the blocks of houses for which we are not providing the corresponding median house price. So on an unseen data, we are reapplying our model that we have already fitted. And we want to see what are these predicted median house values. And then we can compare these predictions to the true median house values that we have, but we are not yet exposing them. And we want to see how good our model is doing a job of estimating and finding this unknown median house values for the test data. So for all the blocks of houses for which we have provided the characteristics in the X test, but we are not providing the Y test. So as usual, like in case of training, we are adding a constant with this library, and then we are saying model.fitted, model and score fitted, so the fitted model, and then dot predict and providing the test data. And those are the test predictions. Now, once we do this, we can then get the test predictions. And if we print those, you can see that we are getting a list of house values. Those are the house values for the blocks of houses, which were included as part of the testing data. So the 20% of our entire data set. Like I mentioned just before, in order to ensure that your model is performing well, you need to check the OLS assumptions. So during the theory section, we learned that there are a couple of assumptions that your model should satisfy, and your data should satisfy for OLS to provide by unbiased and efficient estimates, which means that they are accurate. Their standard error is low, something that we are also seeing as part of the summary results. And your estimates are accurate. So the standard error is a measure that showcases how efficient your estimates are, which means, do you have a high variation? Can the coefficients that you are showing in this table vary a lot, which means that you don't have accurate coefficients, and your coefficients can be all the way from one place to the other, so the range is very large, which means that your standard error will be very large, and this is a bad sign, or you are dealing with an accurate estimation, and it's a more precise estimation. And in that case, the standard error will be low. And unbiased estimate means that your estimates are a true representation of the pattern between each pair of independent variable and the response variable. If you want to learn more about this idea of bias, unbiased, and then efficiency, ensure to check the Fundamentals to Statistics course at Lunar Tech, because it explains very clearly these concepts in detail. So here I'm assuming that you know, or maybe you don't even need it, but I would suggest you to know at a higher level, at least. Then let's quickly do the checking of OLS assumptions. So the first assumption is the linearity assumption, which means that your model is linear in parameters. One way of checking that is by using your already fitted model and your predicted model, so the Y test, which are your true median house values for your test data, and then test predictions, which are your predicted median house values for non-seen data. So you are using the true values and the predicted values in order to plot them, and then to also plot the best fitted line in an ideal situation when you would make no error and your model would give you the exact true values, and then see how well your, how linear is this relationship? Do we actually have a linear relationship? relationship. Now, if the observed versus predicted values, where the observed means the real test-wise and the predicted means the test predictions, if this pattern is kind of linear and matching this perfect linear line, then you have assumption one that is satisfied. Your linearity assumption is satisfied. And you can say that your data and your model is indeed linear in parameters. Then we have the second assumption, which states that your samples should be random. And this basically translates that the expectation of your error terms should be equal to 0. And one way of checking this is by simply taking the residuals from your fitted model, so model and score fitted, and then that residual. So you take the residuals. You obtain the average, which is a good estimate of your expectation of errors. And then this is the mean of residuals, so the average residuals, where the residuals is the estimate of your true error terms. And then here, what I do is I just round up to the two decimals behind the point. This means that we are getting this average amount of errors, or the estimate of the errors, which we are referring as residuals. And if this number is equal to 0, which is the case, so the mean of the residuals in our model is 0, it means that indeed the expectation of the error terms, at least the estimate of it, expectation of the residuals, is indeed equal to 0. Another way of checking the second assumption, which is that the model is based on the random sample, and the sample we are using is random, which means that the expectation of the error terms is equal to 0 is by plotting the residuals versus fitted values. So we are taking the residuals from the fitted model, and we are comparing to the fitted values that comes from the model. And we are looking at this graph, this scatterplot, which you can see in here, and we're looking whether this pattern is symmetric around the threshold of 0. So you can see this line kind of comes right in the middle of this pattern, which means that on average, we have residuals that are across 0. So the mean of the residuals is equal to 0. And that's exactly what we were calculating also here. Therefore, we can say that we are indeed dealing with a random sample. This plot is also super useful when it comes to the fourth assumption that will come a bit later. So for now, let's check the third assumption, which is the assumption of exogeneity. So exogeneity means that each of our independent variables should be uncorrelated from the error terms. So there is no omitted variable bias, there is no reverse causality, which means that the independent variable has an impact on the dependent variable, but not the other way around. So dependent variable should not have an impact and should not cause the independent variable. So for that, there are a few ways that we can deal with this. One way is just straightforward to compute the correlation coefficient between each of these independent variables and the residuals that you have obtained from your fitted model. Such a simple technique that you can use in a very quick way to understand what is this correlation between each pair of independent variable and the residuals, which are the best estimates of your error terms. And in this way, you can understand whether there is a correlation between your independent variables and your error terms. Another way you can do that, and this is more advanced and a bit more towards the econometrical side, is by using this test, which is called the Durbin-Wu-Hausman test. So this Durbin-Wu-Hausman test is a more professional, more advanced way of using an econometrical test to find out whether you have exogeneity, so exogeneity assumption is satisfied, or you have endogeneity, which means that one or multiple of your independent variables is potentially correlated with your error terms. I won't go into detail of this test. I'll put some explanation here, and also feel free to check any introductory to econometrics course to understand more on this Durbin-Wu-Hausman test for exogeneity assumption. The fourth assumption that we will talk about is the homoscedasticity. Homoscedasticity assumption states that the error terms should have a variance that is constant, which means that when we are looking at this variation that the model is making across different observations, that when we look at them, the variation is kind of constant. So we have all these cases, when the observations for which the residuals are a bit small, in some cases, a bit large, we have this mirror when it comes to this figure with what we are calling heteroscedasticity, which means that homoscedasticity assumption is violated, our error terms do not have a variation that is constant across all the observations, and we have a high variation and different variations for different observations. So we have the heteroscedasticity issue. We should consider a bit more flexible approaches like GLS, FGLS, GMM, all bit more advanced econometrical algorithms. So the final part of this case study will be to show you how you can do this all, but for traditional machine learning side by using the scikit-learn. So in here, I'm using the standard scaler function in order to scale my data, because we saw in the summary of the table that we got from the statsmodels.api that our data is at a very high scale because the median house values are those large numbers. The median age of the house is in this very large numbers. That's something that you want to avoid when you are using the linear regression as a predictive analytics model. When you are using it for interpreting purposes, then you should keep the scales because it's easier to interpret those values and to understand what is the difference in the median price of the house when you compare different characteristics of the blocks of houses. But when it comes to using it for predictive analytics purposes, which means that you really care about the accuracy of your predictions, then you need to scale your data and ensure that your data is standardized. One way of doing that is by using this standard scaler function in the scikit-learn.preprocessing. And the way I do it is that I initialize the scaler by using the standard scaler and then parentheses, which I just imported from this scikit-learn library. And then I am taking this scaler, I'm doing dot fit underscore transform extreme, which basically means take the independent variables and ensure that we scale and standardize the data. And standardization simply means that we are standardizing the data. the data that we have to ensure that some large values do not wrongly influence the predictive power of the model. So the model is not confused by the large numbers and finds a wrong variation. But instead, it focuses on the true variation in the data based on how much the change in one independent variable causes a change in the dependent variable. Here, given that we are dealing with a supervised learning algorithm, the x-train scaled will be then containing our standardized features, so independent variables. And then x-test scaled will contain our standardized test features, so the unseen data that the model will not see during the training, but only during prediction. And then what we will be doing is that we will also use the y-train. And y-train is the dependent variable in our supervised model. And y-train corresponds to the training data. So we will then first initialize the linear regression here, so linear regression model from scikit-learn. And then we will initialize the model. This is just an empty linear regression model. And then we will take this initialized model, and then we will fit them on the training data, so x underscore trained underscore scaled. So this is the trained features. And then the dependent variable from training data, so y-train. Do note that I'm not scaling the dependent variable. This is a common practice, because you don't want to standardize your dependent variable. Rather, you want to ensure that your features are standardized, because what you care about is about the variation in your features, and to ensure that the model doesn't mess up when it's learning from those features. Less when it comes to looking into the impact of those features on your dependent variable. So then I am fitting the model on this training data, so features and then dependent variable. And then I'm using this fitted model, the LR, which already has learned from this features and dependent variable during supervised training. And then I'm using the x test scale, so the test standardized data, in order to perform the prediction. So to predict the median house values for the test data, unseen data. And you can notice that here in no places I'm using y test. Y test I'm keeping to myself, which is the dependent variable true values, such that I can then compare to this predicted values and see how well my model was able to actually get the predictions. Now, let's actually also do one more step. I'm importing from the scikit-learn the matrix, such as mean squared error. And I'm using the mean squared error to find out how well my model was able to predict those house prices. So this means that we have, on average, we are making an error of $59,000 when it comes to the median house prices, which dependent on what we consider as large or small, this is something that we can look into. So like I mentioned in the beginning, the idea behind linear regression using in this specific course is not to use it in terms of pure traditional machine learning, but rather than to perform causal analysis and to see how we can interpret it. When it comes to the quality of the predictive power of the model, then if you want to improve this model, this can be considered as a next step. You can understand whether your model is overfitting. And then the next step could be to apply, for instance, the the LASSO regularization, so LASSO regression which addresses the overfitting. You can also consider going back and removing more outliers from the data. Maybe the outliers that we have removed was not enough, so you can also apply that factor. Then another thing that you can do is to consider a bit more advanced machine learning algorithms. Because it can be that although the regression assumption is satisfied, but still using bit more flexible models like random forest decision trees or boosting techniques will be bit more appropriate and this will give you higher predictive power. Consider also working more with this scaled version or normalization of your data. As the next step in your machine learning journey, you can consider learning bit more advanced machine learning models. So now when you know in detail what is linear regression and how you can use it, how you can train and test a machine learning model, a simple one yet very popular one, and you also know what is logistic regression and all these basics, you are ready to go on to the next step, which is learning all the other popular traditional machine learning models. Think about learning decision trees for modeling nonlinear relationships. Think about learning bagging, boosting, random forest, and different sorts of optimization algorithms like gradient descent, SGD, SGD with momentum, Adam, Adam-V, RMSprop, and what is the difference between them and how you can implement them. And also consider learning clustering approaches like k-means, db-scan, hierarchical clustering, doing this will help you to get more hands-on and go to this next step when it comes to the machine learning. Once you have covered all these fundamentals, you are ready to go one step further, which is getting into deep learning. Thank you for watching this video. If you like this content, make sure to check all the other videos available on this channel. And don't forget to subscribe, like, and comment to help the algorithm to make this content more accessible to everyone across the world. And if you want to get free resources, make sure to check the free resources section at LunarTech.ai. And if you want to become a job-ready data scientist and you are looking for this accessible bootcamp that will help you to make you a job-ready data scientist, consider enrolling to the data science bootcamp. The ultimate data science bootcamp at LunarTech.ai. You will learn all the theory, the fundamentals to become a job-ready data scientist. You will also implement the learned theory into real-world multiple data science projects. Beside this, after learning the theory and practicing it with the real-world case studies, you will also prepare for your data science interviews. And if you want to stay up to date with the recent developments in tech, what are the headlines that you have missed in the last week, what are the open positions currently in the market across the globe, and what are the tech startups that are making waves in the tech, ensure to subscribe to the Data Science and AI newsletter from LunarTech.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment