Kaggle学习 Learn Machine Learning 2.Starting Your ML Project 开始你的ML项目!

2.Starting Your ML Project 开始你的ML项目!

本文是Kaggle自助学习下的文章,转回到目录点击这里



This tutorial is part of the Learn Machine Learning educational track.本教程是Learn Machine Learning教育课程的一部分。原文链接点击此处

 

Starting Your Project 开始你的项目,狗腿子!

 

You are about to build a simple model and then continually improve it. It iseasiest to keep one browser tab for the tutorials you are reading, and aseparate browser window with the code you are writing. You will write all yourcode in the same place, even as you progress through the sequence ofexplanations and instructions spread over multiple pages.你即将建立一个简单的模型,然后不断地改进它。对于正在阅读的教程保留一个浏览器选项卡,以及使用你正在编写的代码的单独浏览器窗口,这是最容易的。即使你在遍历多个页面的解释和说明序列中进行操作,你可以将所有代码写入相同的位置。(这里指 你可以开启两个选项卡,一个用来code 一个用来阅读)

 

You will open a workspace to write your code using THIS LINK.. Open that link in a new tab. You will readexamples predicting home prices using data from Melbourne, Australia. You willthen write code to build a model predicting prices in the US state of Iowa. TheIowa data is pre-loaded in your coding notebook.你将打开一个工作区以使用THIS LINK(不需要*,按住Ctrl点击访问).编写代码。在新标签中打开该链接。你将使用澳大利亚墨尔本的数据阅读预测房价的例子。然后,你将编写代码建立一个预测美国爱荷华州价格的模型。爱荷华州的数据已预先加载到你的编码笔记本中。

 

Working in Kaggle Notebooks 在Kaggle上工作吧!

You will be coding in a "notebook" environment. These allow you to easilysee your code and its output in one place. A couple tips on the Kaggle notebookenvironment:你将在“笔记本”环境中进行编码。这些允许你在一个地方轻松查看代码和输出。关于Kaggle笔记本电脑环境的一些提示:

 

1)   It is composed of "cells."You will write code in the cells. Add a new cell by clicking on a cell, andthen using the buttons in that look like this.它由“单元格”组成。你将在单元格中编写代码。点击这个按钮,添加一个新的单元格

The arrowsindicate whether the new cell goes above or below your current location. 箭头表示新单元格是在当前位置之上还是之下。

 Kaggle学习 Learn Machine Learning 2.Starting Your ML Project 开始你的ML项目!

2)  Executethe code in the current cell with the keyboard shortcut Control-Enter.使用键盘快捷键Control-Enter执行当前单元中的代码。

 

Using Pandas to Get Familiar WithYour Data

Thefirst thing you'll want to do is familiarize yourself with the data. You'll usethe Pandas library for this. Pandas is the primary tool that modern datascientists use for exploring and manipulating data. Most people abbreviatepandas in their code as pd. We do this with the command.你要做的第一件事就是熟悉数据。你将使用Pandas (一个非常强大的数据处理)。Pandas是现代数据科学家用于探索和操纵数据的主要工具。大多数人在他们的代码中将pandas缩写为pd。我们用命令来做到这一点(这里的代码是Python3)

 

        import pandas as pd

 

The most important part ofthe Pandas library is the DataFrame. A DataFrame holds the type of data youmight think of as a table. This is similar to a sheet in Excel, or a table in aSQL database. The Pandas DataFrame has powerful methods for most things you'llwant to do with this type of data. Let's start by looking at a basic dataoverview with our example data from Melbourne and the data you'll be workingwith from Iowa.Pandas库最重要的部分是DataFrame。 DataFrame保存你可能认为是表格的数据类型。这与Excel中的工作表或SQL数据库中的表类似。Pandas DataFrame具有强大的方法来处理你想要对这类数据执行的大部分操作。我们首先看一下基本的数据概述,其中包含墨尔本的示例数据以及你将从爱荷华州处理的数据。

Theexample will use data at the file path(该示例将使用文件路径中的数据../input/melbourne-housing-snapshot/melb_data.csv. 

Your datawill be available in your notebook at (你的数据将在你的notebook上提供)../input/train.csv 

(which isalready typed into the sample code for you这已经输入到你的示例代码中    已经出现在代码里了,实际为../input/house-prices-advanced-regression-techniques/train.csv ). Rememberthat you will be using the Iowa data instead of the Melbourne data when writingyour own code.请记住,编写自己的代码时,你将使用爱荷华州的数据而不是墨尔本的数据。

In the example data from Melbourne, we load andexplore the data with the following:在来自墨尔本的示例数据中,我们使用以下方法加载和探索数据:

# save filepath to variable for easier access
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 
# print a summary of the data in Melbourne data
     
print(melbourne_data.describe())

(这里的缩进有问题,python请注意缩进)

 

Interpreting Data Description

Theresults show 8 numbers for each column in your original dataset. The firstnumber, the count, shows how many rows have non-missing values.结果显示了原始数据集中每列的8个数字。第一个数字即计数显示有多少行具有非缺失值。

 

Missingvalues arise for many reasons. For example, the size of the 2nd bedroomwouldn't be collected when surveying a 1 bedroom house. We'll come back to thetopic of missing data.缺少价值的原因有很多。例如,当调查一间卧室房屋时,第二间卧室的大小将不会被收集。我们将回到丢失数据的话题。

 

Thesecond value is the mean, which is the average. Under that, std is the standarddeviation, which measures how numerically spread out the values are.第二行mean代表平均值。在那之下,std是标准偏差,它衡量数值如何在数值上分布。

 

Tointerpret the min, 25%, 50%, 75% and max values, imagine sorting each columnfrom lowest to highest value. The first (smallest) value is the min. If you goa quarter way through the list, you'll find a number that is bigger than 25% ofthe values and smaller than 75% of the values. That is the 25% value(pronounced "25th percentile"). The 50th and 75th percentiles aredefined analgously, and the max is the largest number.要解释最小值,25%,50%,75%和最大值,可以想象将每列从最低值排序到最高值。第一个(最小)值是最小值。如果你在列表中四分之一程度,你会发现一个数字大于值的25%,小于75%的值。这是25%的价值(发音为“25th percentile”)。第50百分位和第75百分位分别类似定义,而最大数是最大的数。(总感觉他说错了,回头改)

 

Your Turn

1.Ifyou didn't open a window for writing your own code using the link at the top ofthe page, open  thislink in a new tab to access your coding workspace.如果你没有打开窗口使用页面顶部的链接编写自己的代码,请在新标签页中打开此链接来访问编码工作区。

 

2.Readthe Iowa data and print the summary information. The file path for your data isalready shown in your coding notebook. Look at the mean, minimum and maximumvalues for the first few fields. Are any of the values so crazy that it makesyou think you've misinterpreted the data?阅读爱荷华州的数据并打印摘要信息。数据的文件路径已经显示在你的编码笔记本中。查看前几个字段的平均值,最小值和最大值。是否有些值很疯狂,以至于你认为你误解了数据?

 

Thereare a lot of fields in this data. You don't need to look at it all quite yet.这些数据中有很多字段。你现在不需要把这些看完。

 

Whenyour code is correct, you'll see the size, in square feet, of the smallest lotin your dataset. This is from the min value of LotArea, and you can see the maxsize too. You should notice that it's a big range of lot sizes!当你的代码正确时,你会看到数据集中最小区域的大小(以平方英尺为单位)。这是LotArea中的最小值,你也可以看到最大值。你应该注意到这是一个很大的范围!

 

You'llalso see some columns filled with .... That indicates that we had too manycolumns of data to print, so the middle ones were omitted from printing.你还会看到一些填充了...的列。这表明我们有太多的数据列需要打印,所以中间的数据被我们给删除了。

 

We'lltake care of both issues in the next step.我们将在下一步中处理这两个问题。

 

Continue

Moveon to the next page where youwill focus in on the most relevant columns.

 

成功结果

Kaggle学习 Learn Machine Learning 2.Starting Your ML Project 开始你的ML项目!

   请放大自己看


本文是Kaggle自助学习下的文章,转回到目录点击这里