机器学习数据模型_每个数据科学家都应该知道的5种机器学习模型

机器学习数据模型

Whether you’re aware of it or not, you’re surely using artificial intelligence (AI) on a daily basis. That’s because it powers-up a lot of the most commonly used products today. From Google and Spotify to Siri and Facebook, all of them use Machine Learning (ML), one of AI’s subsets.

无论您是否意识到这一点，您肯定每天都在使用人工智能(AI)。那是因为它为当今许多最常用的产品加电。从Google和Spotify到Siri和Facebook，它们都使用机器学习(ML)，这是AI的子集之一。

ML allows those sites to serve better and more personalized content, automate processes, and constantly optimize its workings, among other things. It does so through sophisticated algorithms called models. These are mathematical expressions that stand in as a representation of data in a particular context.

ML允许这些网站提供更好，更个性化的内容，自动化流程并不断优化其运行状况。它通过称为模型的复杂算法来做到这一点。这些是数学表达式，代表特定上下文中的数据。

As such, models are essential to analyze the information and get insights out of it. Maybe you’re trying to start a career as a data scientist or want to outsource development of an ML-based application. Whatever your motivation, you’ve come to the right place to learn the basics of the most popular machine learning models.

因此，模型对于分析信息并从中获取见解至关重要。也许您正在尝试开始作为数据科学家的职业，或者想要外包基于ML的应用程序的开发。无论您的动机如何，您都来对地方了，以学习最受欢迎的机器学习模型的基础。

两类，多种可能性 (Two Categories, a Multitude of Possibilities)

Depending on the approach you want to take to process the data, you’ll use one of the 2 main categories of machine learning: supervised and unsupervised machine learning.

根据要采用的处理数据的方法，您将使用机器学习的两个主要类别之一：有监督和无监督机器学习。

Supervised machine learning techniques use known inputs and outputs to identify patterns and understand how the results came to be. It uses training sessions and data sets to comprehend the underlying mechanism of the information. Thus, whenever you introduce a new input that wasn’t part of the training data set, you can get a probable outcome calculated with the identified pattern.

有监督的机器学习技术使用已知的输入和输出来识别模式并了解结果如何。它使用培训课程和数据集来理解信息的基本机制。因此，每当您引入的新输入不属于训练数据集时，就可以使用确定的模式来计算出可能的结果。

ML developers and software outsourcing companies that focus on AI use supervised machine learning in fields as diverse as chemistry, manufacturing, and marketing.

专注于AI的ML开发人员和软件外包公司在化学，制造和营销等各个领域中使用受监督的机器学习。

Unsupervised machine learning, on their part, is a more exploratory approach to data analysis. Instead of understanding data relationships from training examples, it uses unlabeled data to detect potential patterns you didn’t previously know. It does so by grouping similar inputs together based on their traits.

就其本身而言， 无监督机器学习是一种更具探索性的数据分析方法。它没有从训练示例中了解数据关系，而是使用未标记的数据来检测您以前不知道的潜在模式。它是通过将相似输入根据其特征分组在一起来实现的。

A lot of engineers use this approach in different industries. It’s especially useful for research, though you can use it to get insights from your sales department, your product pricing, or your logistics.

许多工程师在不同行业中都使用这种方法。尽管您可以使用它来获取销售部门，产品价格或物流方面的见解，但它对研究特别有用。

The possibilities you can find in both approaches are endless, as you can use them in any way you can imagine. Of course, before you launch yourself to do that, you have to know some of the basic models you have available.

您可以在两种方法中找到无限的可能性，因为您可以以任何可以想象的方式使用它们。当然，在启动该功能之前，您必须了解一些可用的基本模型。

每个数据科学家都应该知道的5种机器学习模型 (5 Machine Learning Models Every Data Scientist Should Know)

1.分类 (1. Classification)

This method is a part of the supervised machine learning category. Its basic goal is to explain or predict a class value. In other words, this model defines the probability of something happening according to one or more inputs.

此方法是有监督机器学习类别的一部分。它的基本目标是解释或预测类值。换句话说，该模型根据一个或多个输入定义发生某事的概率。

For example, you can use classification in an email client to filter spam. In this scenario, you have 2 possible outcomes: an email is spam or it isn’t. Depending on the inputs, the model could predict that based on how you trained it. In fact, you’re doing just that whenever you flag a message as spam in your email account – you’re training the model to understand the basic traits of spam and enhance its protection.

例如，您可以在电子邮件客户端中使用分类来过滤垃圾邮件。在这种情况下，您有两种可能的结果：电子邮件是垃圾邮件还是不是垃圾邮件。取决于输入，模型可以根据您的训练方式进行预测。实际上，每当您在电子邮件帐户中将邮件标记为垃圾邮件时，您所做的就是-训练模型以了解垃圾邮件的基本特征并增强对垃圾邮件的保护。

In short, classification is a method that predicts a class label depending on the training set and its values (which defines the class labels in the first place). This technique encompasses several models, including logistic regression, decision trees, random forests, multilayer perceptrons, and gradient-boosted trees, among others.

简而言之，分类是一种根据训练集及其值(首先定义类别标签)来预测类别标签的方法。该技术包括多个模型，包括逻辑回归，决策树，随机森林，多层感知器和梯度增强树等。

2.聚类 (2. Clustering)

Clustering includes several methods that are part of the unsupervised machine learning category. As such, you can use it on unlabeled data sets to group values according to one or more specific traits or characteristics. The result? The algorithms form groups (called clusters) of similar values.

集群包括几种方法，它们属于无监督机器学习类别。这样，您可以将其用于未标记的数据集，以根据一个或多个特定特征或特征对值进行分组。结果？该算法形成相似值的组(称为簇)。

For that to happen, you need to define a similarity measure, which is simply a metric that looks at one or more features. Once you have this measure, you can apply it to the data set to have clusters. For instance, you could have a lot of music albums that you could categorize by genre, by decade, or by country. Each of these similarity measures would offer different clusters and insights, so it’ll be up to you to define which one works best.

为此，您需要定义一个相似性度量，它只是一种查看一个或多个特征的度量。一旦有了该度量，就可以将其应用于数据集以具有聚类。例如，您可能有很多音乐专辑，可以按流派，按年代或按国家分类。这些相似性度量中的每一个都会提供不同的聚类和见解，因此，由您来确定哪个最有效。

Clustering algorithms such as noise-based application density-based spatial clustering (DBSCAN), cluster hierarchical clustering, and medium-shift clustering, among others, are some of the options you can choose from. Several sectors and activities use them for things such as market segmentation, social network analysis, and medical imagery.

您可以选择一些聚类算法，例如基于噪声的应用程序基于密度的空间聚类(DBSCAN)，聚类分层聚类和中移聚类。多个部门和活动将它们用于诸如市场细分，社交网络分析和医学图像之类的事情。

3.回归 (3. Regression)

Regression is another method that’s part of supervised machine learning. With it, you use previous data to predict or explain a real or continuous value (such as prices or salaries). Its simplest form is linear regression which is usually more approximate than more complex forms like polynomial regression or neural networks).

回归是监督机器学习的一部分，是另一种方法。借助它，您可以使用以前的数据来预测或解释实际或连续的价值(例如价格或薪水)。它最简单的形式是线性回归，通常比多项式回归或神经网络等更复杂的形式更为近似。

Regression techniques begin with a hypothesis, which is a function based on input values and unknown parameters. When training an algorithm to tackle regression, you have to use a data set that allows the algorithm to refine its approach to the hidden parameters. After you refine the results, you can take the process to a real data set to apply your hypothesis.

回归技术始于假设，该假设是基于输入值和未知参数的函数。在训练算法以解决回归问题时，必须使用一个数据集，该数据集可使算法将其方法改进为隐藏参数。优化结果之后，您可以将过程应用于真实数据集以应用假设。

An approach like this one is useful for things like estimating the value of a house. By combining different input values (such as square footage, age of the building, energy consumption, etc.), you could predict how much it could cost in the future, after renovations, or with any variation on those inputs whatsoever.

这样的方法对于估算房屋价值非常有用。通过组合不同的输入值(例如，平方英尺，建筑物的年龄，能耗等)，您可以预测在将来，装修后或这些输入有任何变化的情况下它可能要花费多少。

4.降维 (4. Dimensionality Reduction)

This is another method of supervised machine learning. You should use it to reduce the noise in your data sets, which can get so big that sometimes you might end up processing a lot of useless or redundant data. With dimensionality reduction, you get rid of some of the unwanted information by integrating similar data in larger groups that reduce the amount of detail.

这是有监督的机器学习的另一种方法。您应该使用它来减少数据集中的噪声，噪声可能会变得很大，以至于有时您可能最终会处理大量无用或冗余的数据。通过降维，您可以通过将相似的数据集成到较大的组中来减少一些细节，从而消除一些不必要的信息。

Think of it like this. Imagine you have a market segmentation of a vast majority of 30-year-old women. You could reduce the size of your data set by ignoring the information coming from men from ages that are above or below 30. You’d be losing some data, sure, but the losses would be acceptable enough for the resulting insights.

这样想吧。假设您对绝大多数30岁女性进行了细分。您可以通过忽略来自30岁以上或30岁以下男性的信息来减少数据集的大小。当然，您会丢失一些数据，但是对于所获得的见解，这些损失是可以接受的。

There are several methods you could use to apply dimensional reduction, including popular ones like principal component analysis and t-stochastic incorporation of the neighbor (t-SNE). These approaches can be linear or non-linear and apply different logic to the reduction. So, you’d better consider the best one according to your data and personal needs.

您可以使用几种方法来进行降维，包括流行的方法，例如主成分分析和邻居的t随机合并(t-SNE)。这些方法可以是线性的也可以是非线性的，并且可以将不同的逻辑应用于简化。因此，您最好根据自己的数据和个人需求来考虑最好的一种。

5.组合方法 (5. Ensemble Methods)

This approach combines several supervised machine learning predictive models into one to refine the resulting predictions. It’s the whole “the strength of the wolf is the pack” kind of perspective that lies beneath the ensemble method approach.

这种方法将几种监督的机器学习预测模型组合为一个模型，以完善结果预测。这就是整体方法方法下的整个“狼群力量”。

Using different models can lead you to better results as they combine their strengths to reduce the weaknesses you’d find if you used them separately. Besides, the combination reduces the bias and variance of the learning model, which leads to fewer inaccuracies.

使用不同的模型可以使您获得更好的结果，因为它们结合了各自的优势，可以减少如果单独使用它们会发现的劣势。此外，该组合减少了学习模型的偏差和方差，从而减少了误差。

You should know that ensemble methods typically require more computation than a single model, so some people see them as a way to compensate for poor learning algorithms by way of computational processing. However, they excel in specific tasks such as face recognition, malware detection, and land mapping.

您应该知道，集成方法通常比单个模型需要更多的计算，因此有人认为它们是通过计算处理来补偿学习算法较差的一种方法。但是，它们在诸如面部识别，恶意软件检测和地形映射等特定任务方面表现出色。

结束语 (Closing Comments)

Don’t believe for a second that machine learning models end in these 5. There are other powerful models, including deep learning algorithms that are all the rage now. However, learning about the basics might open the door for you to understand the complex world of artificial intelligence in general.

不要再相信机器学习模型会在这5种方法中结束。还有其他强大的模型，包括现在流行的深度学习算法。但是，了解基础知识可能会为您全面了解人工智能的复杂世界打开一扇门。

Needless to say, you’ll need a lot of knowledge to create one of these algorithms on your own – let alone building an accurate one. So, in case you are in need of one for your business, you have 2 paths. Either outsource the development of your model or sit down and start learning.

不用说，您将需要大量知识来自行创建这些算法之一，更不用说构建准确的算法了。因此，如果您的业务需要一个，则有两种方法。要么外包模型开发，要么坐下来开始学习。

翻译自: https://www.thecrazyprogrammer.com/2020/01/machine-learning-models.html