One-Hot Encoding独热编码
one-hot encoding:The Standard Approach for Categorical Features
Categorical feature:如,color of flowers: yellow, red, green。
one-hot encoding:一种码制,有多少个状态(或者叫类别值)就有多少个比特,且只有一个比特为1,其它全为0.
Pandas offers a convenient function called get_dummies to get one-hot encodings.
code
独热编码
Pandas offers a convenient function called get_dummies to get one-hot encodings. Call it like this:
one_hot_encoded_data = pd.get_dummies(data)
help(pd.get_dummies)
Help on function get_dummies in module pandas.core.reshape.reshape:
get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
Convert categorical variable into dummy/indicator variables
Parameters
----------
data : array-like, Series, or DataFrame
prefix : string, list of strings, or dict of strings, default None
String to append DataFrame column names.
Pass a list with length equal to the number of columns
when calling get_dummies on a DataFrame. Alternatively, `prefix`
can be a dictionary mapping column names to prefixes.
prefix_sep : string, default '_'
If appending prefix, separator/delimiter to use. Or pass a
list or dictionary as with `prefix.`
dummy_na : bool, default False
Add a column to indicate NaNs, if False NaNs are ignored.
columns : list-like, default None
Column names in the DataFrame to be encoded.
If `columns` is None then all the columns with
`object` or `category` dtype will be converted.
sparse : bool, default False
Whether the dummy columns should be sparse or not. Returns
SparseDataFrame if `data` is a Series or if all columns are included.
Otherwise returns a DataFrame with some SparseBlocks.
drop_first : bool, default False
Whether to get k-1 dummies out of k categorical levels by removing the
first level.
.. versionadded:: 0.18.0
dtype : dtype, default np.uint8
Data type for new columns. Only a single dtype is allowed.
.. versionadded:: 0.23.0
Returns
-------
dummies : DataFrame or SparseDataFrame
Examples
--------
>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
>>> s1 = ['a', 'b', np.nan]
>>> pd.get_dummies(s1)
a b
0 1 0
1 0 1
2 0 0
>>> pd.get_dummies(s1, dummy_na=True)
a b NaN
0 1 0 0
1 0 1 0
2 0 0 1
>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
... 'C': [1, 2, 3]})
>>> pd.get_dummies(df, prefix=['col1', 'col2'])
C col1_a col1_b col2_a col2_b col2_c
0 1 1 0 0 1 0
1 2 0 1 1 0 0
2 3 1 0 0 0 1
>>> pd.get_dummies(pd.Series(list('abcaa')))
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 1 0 0
>>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
>>> pd.get_dummies(pd.Series(list('abc')), dtype=float)
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
See Also
--------
Series.str.get_dummies
align:
final_train_predictors, final_test_predictors= one_hot_encoded_training_data_predictors.align(one_hot_encoded_test_data_predictors, join='left',axis=1, fill_value=0)
#axis=1:columns
#join='left' : keep exactly the columns from our training data
#fill_value=0:对齐后没有值的地方填0,默认填的是NaN
#align
help(one_hot_encoded_X.align)
Help on method align in module pandas.core.frame:
align(self, other, join='outer', axis=None, level=None, copy=True, fill_value=None, method=None, limit=None, fill_axis=0, broadcast_axis=None) method of pandas.core.frame.DataFrame instance
Align two objects on their axes with the
specified join method for each axis Index
Parameters
----------
other : DataFrame or Series
join : {'outer', 'inner', 'left', 'right'}, default 'outer'
axis : allowed axis of the other object, default None
Align on index (0), columns (1), or both (None)
level : int or level name, default None
Broadcast across a level, matching Index values on the
passed MultiIndex level
copy : boolean, default True
Always returns new objects. If copy=False and no reindexing is
required then original objects are returned.
fill_value : scalar, default np.NaN
Value to use for missing values. Defaults to NaN, but can be any
"compatible" value
method : str, default None
limit : int, default None
fill_axis : {0 or 'index', 1 or 'columns'}, default 0
Filling axis, method and limit
broadcast_axis : {0 or 'index', 1 or 'columns'}, default None
Broadcast values along this axis, if aligning two objects of
different dimensions
Returns
-------
(left, right) : (DataFrame, type of other)
Aligned objects
Example:西瓜数据3.0
#_*_coding:utf-8_*_
import pandas as pd
watermelon_data= pd.read_csv(r'G:\kaggle\watermelon_3.csv')
watermelon_data
编号 | 色泽 | 根蒂 | 敲声 | 纹理 | 脐部 | 触感 | 密度 | 含糖率 | 好瓜 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 青绿 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.697 | 0.460 | 是 |
1 | 2 | 乌黑 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 0.774 | 0.376 | 是 |
2 | 3 | 乌黑 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.634 | 0.264 | 是 |
3 | 4 | 青绿 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 0.608 | 0.318 | 是 |
4 | 5 | 浅白 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.556 | 0.215 | 是 |
5 | 6 | 青绿 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 0.403 | 0.237 | 是 |
6 | 7 | 乌黑 | 稍蜷 | 浊响 | 稍糊 | 稍凹 | 软粘 | 0.481 | 0.149 | 是 |
7 | 8 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 硬滑 | 0.437 | 0.211 | 是 |
8 | 9 | 乌黑 | 稍蜷 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 0.666 | 0.091 | 否 |
9 | 10 | 青绿 | 硬挺 | 清脆 | 清晰 | 平坦 | 软粘 | 0.243 | 0.267 | 否 |
10 | 11 | 浅白 | 硬挺 | 清脆 | 模糊 | 平坦 | 硬滑 | 0.245 | 0.057 | 否 |
11 | 12 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 软粘 | 0.343 | 0.099 | 否 |
12 | 13 | 青绿 | 稍蜷 | 浊响 | 稍糊 | 凹陷 | 硬滑 | 0.639 | 0.161 | 否 |
13 | 14 | 浅白 | 稍蜷 | 沉闷 | 稍糊 | 凹陷 | 硬滑 | 0.657 | 0.198 | 否 |
14 | 15 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 0.360 | 0.370 | 否 |
15 | 16 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 硬滑 | 0.593 | 0.042 | 否 |
16 | 17 | 青绿 | 蜷缩 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 0.719 | 0.103 | 否 |
遇到问题:读出来的中文乱码。
解决:将csv文件用记事本打开,然后点另存为后出现编码选项,选择:UTF-8
watermelon_data.dtypes
编号 int64
色泽 object
根蒂 object
敲声 object
纹理 object
脐部 object
触感 object
密度 float64
含糖率 float64
好瓜 object
dtype: object
#色泽下有几种值
watermelon_data['色泽'].nunique()
3
#判断色泽的的值的类型是不是非数值的
watermelon_data['色泽'].dtype=="object"
True
watermelon_data['含糖率'].dtype=="float"
True
#target
y=watermelon_data['好瓜']
#X
X=watermelon_data.drop(['编号','好瓜'], axis=1)
X
#----------或者------------
#features=...
#X= watermelon_data[features]
色泽 | 根蒂 | 敲声 | 纹理 | 脐部 | 触感 | 密度 | 含糖率 | |
---|---|---|---|---|---|---|---|---|
0 | 青绿 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.697 | 0.460 |
1 | 乌黑 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 0.774 | 0.376 |
2 | 乌黑 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.634 | 0.264 |
3 | 青绿 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 0.608 | 0.318 |
4 | 浅白 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.556 | 0.215 |
5 | 青绿 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 0.403 | 0.237 |
6 | 乌黑 | 稍蜷 | 浊响 | 稍糊 | 稍凹 | 软粘 | 0.481 | 0.149 |
7 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 硬滑 | 0.437 | 0.211 |
8 | 乌黑 | 稍蜷 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 0.666 | 0.091 |
9 | 青绿 | 硬挺 | 清脆 | 清晰 | 平坦 | 软粘 | 0.243 | 0.267 |
10 | 浅白 | 硬挺 | 清脆 | 模糊 | 平坦 | 硬滑 | 0.245 | 0.057 |
11 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 软粘 | 0.343 | 0.099 |
12 | 青绿 | 稍蜷 | 浊响 | 稍糊 | 凹陷 | 硬滑 | 0.639 | 0.161 |
13 | 浅白 | 稍蜷 | 沉闷 | 稍糊 | 凹陷 | 硬滑 | 0.657 | 0.198 |
14 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 0.360 | 0.370 |
15 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 硬滑 | 0.593 | 0.042 |
16 | 青绿 | 蜷缩 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 0.719 | 0.103 |
#处理categorical feature:独热编码
one_hot_encoded_X= pd.get_dummies(X)
one_hot_encoded_X
密度 | 含糖率 | 色泽_乌黑 | 色泽_浅白 | 色泽_青绿 | 根蒂_硬挺 | 根蒂_稍蜷 | 根蒂_蜷缩 | 敲声_沉闷 | 敲声_浊响 | 敲声_清脆 | 纹理_模糊 | 纹理_清晰 | 纹理_稍糊 | 脐部_凹陷 | 脐部_平坦 | 脐部_稍凹 | 触感_硬滑 | 触感_软粘 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.697 | 0.460 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 0.774 | 0.376 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 0.634 | 0.264 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
3 | 0.608 | 0.318 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
4 | 0.556 | 0.215 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
5 | 0.403 | 0.237 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
6 | 0.481 | 0.149 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
7 | 0.437 | 0.211 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
8 | 0.666 | 0.091 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
9 | 0.243 | 0.267 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
10 | 0.245 | 0.057 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
11 | 0.343 | 0.099 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
12 | 0.639 | 0.161 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
13 | 0.657 | 0.198 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
14 | 0.360 | 0.370 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
15 | 0.593 | 0.042 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
16 | 0.719 | 0.103 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
from sklearn.tree import DecisionTreeClassifier
#model
model= DecisionTreeClassifier()
model
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
#fit
clf= model.fit(one_hot_encoded_X, y)
clf
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
#predict[青绿,蜷缩,沉闷,清晰,凹陷,硬滑,0.608,0.300]
clf.predict([[0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,1,0,0.608,0.300]])
array(['\xe5\x90\xa6'], dtype=object)
怎么能看到中文???若有多个文件(test dataset、一些其它的做预测的数据)。Sklearn对columns的顺序敏感,所以如果训练集和测试集没有对齐,结果将毫无意义。
if a categorical had a different number of values in the training data vs the test data,这将有可能发生。
如何确保test data和training data以同样的方式编码呢?
假如:
test data:watermelon_3_test.csv(csv文件编码:UTF-8)
import pandas as pd
watermelon_test_data= pd.read_csv(r'G:\kaggle\watermelon_3_test.csv')
watermelon_test_data= watermelon_test_data.drop(['编号'], axis=1)
watermelon_test_data
#看到:test文件里的纹理仅有2个值,而训练数据中纹理有3个值,那one_hot encoding后是不一致的
色泽 | 根蒂 | 敲声 | 纹理 | 脐部 | 触感 | 密度 | 含糖率 | |
---|---|---|---|---|---|---|---|---|
0 | 乌黑 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.697 | 0.460 |
1 | 乌黑 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 0.774 | 0.376 |
2 | 乌黑 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.611 | 0.264 |
3 | 青绿 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 0.608 | 0.318 |
4 | 青绿 | 稍蜷 | 浊响 | 稍糊 | 稍凹 | 硬滑 | 0.639 | 0.172 |
5 | 浅白 | 稍蜷 | 沉闷 | 稍糊 | 凹陷 | 硬滑 | 0.657 | 0.198 |
6 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 0.360 | 0.370 |
one_hot_encoded_watermelon_test_data= pd.get_dummies(watermelon_test_data)
one_hot_encoded_watermelon_test_data
密度 | 含糖率 | 色泽_乌黑 | 色泽_浅白 | 色泽_青绿 | 根蒂_稍蜷 | 根蒂_蜷缩 | 敲声_沉闷 | 敲声_浊响 | 纹理_清晰 | 纹理_稍糊 | 脐部_凹陷 | 脐部_稍凹 | 触感_硬滑 | 触感_软粘 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.697 | 0.460 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 |
1 | 0.774 | 0.376 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
2 | 0.611 | 0.264 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 |
3 | 0.608 | 0.318 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
4 | 0.639 | 0.172 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
5 | 0.657 | 0.198 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
6 | 0.360 | 0.370 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
如何让test.csv的one_hot_encoded_watermelon_test_data和训练集one_hot_encoded_X的编码保持align呢:
final_train, final_test= one_hot_encoded_X.align(one_hot_encoded_watermelon_test_data, join='left',axis=1, fill_value=0)
#axis=1:columns
#join='left' : keep exactly the columns from our training data
#fill_value=0:对齐后没有值的地方填0,默认填的是NaN
final_train
密度 | 含糖率 | 色泽_乌黑 | 色泽_浅白 | 色泽_青绿 | 根蒂_硬挺 | 根蒂_稍蜷 | 根蒂_蜷缩 | 敲声_沉闷 | 敲声_浊响 | 敲声_清脆 | 纹理_模糊 | 纹理_清晰 | 纹理_稍糊 | 脐部_凹陷 | 脐部_平坦 | 脐部_稍凹 | 触感_硬滑 | 触感_软粘 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.697 | 0.460 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 0.774 | 0.376 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 0.634 | 0.264 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
3 | 0.608 | 0.318 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
4 | 0.556 | 0.215 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
5 | 0.403 | 0.237 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
6 | 0.481 | 0.149 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
7 | 0.437 | 0.211 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
8 | 0.666 | 0.091 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
9 | 0.243 | 0.267 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
10 | 0.245 | 0.057 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
11 | 0.343 | 0.099 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
12 | 0.639 | 0.161 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
13 | 0.657 | 0.198 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
14 | 0.360 | 0.370 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
15 | 0.593 | 0.042 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
16 | 0.719 | 0.103 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
final_test
密度 | 含糖率 | 色泽_乌黑 | 色泽_浅白 | 色泽_青绿 | 根蒂_硬挺 | 根蒂_稍蜷 | 根蒂_蜷缩 | 敲声_沉闷 | 敲声_浊响 | 敲声_清脆 | 纹理_模糊 | 纹理_清晰 | 纹理_稍糊 | 脐部_凹陷 | 脐部_平坦 | 脐部_稍凹 | 触感_硬滑 | 触感_软粘 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.697 | 0.460 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 0.774 | 0.376 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 0.611 | 0.264 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
3 | 0.608 | 0.318 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
4 | 0.639 | 0.172 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
5 | 0.657 | 0.198 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
6 | 0.360 | 0.370 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
clf.predict(final_test)
array(['\xe6\x98\xaf', '\xe6\x98\xaf', '\xe6\x98\xaf', '\xe6\x98\xaf',
'\xe5\x90\xa6', '\xe5\x90\xa6', '\xe5\x90\xa6'], dtype=object)