数据挖掘算法比赛 - 简单经验总结-白红宇

数据挖掘算法比赛 - 简单经验总结

阅读量：4321 次

发布时间：2019-06-06

本文共 7818 字，大约阅读时间需要 26 分钟。

一、单个特征的EDA

对于 binary feature 和 categorical feature，train['feature_name'].value_counts().sort_index().plot(kind='bar')

对于 continuous numerical feature，

def cdf_plot(data_series):    data_size = len(data_series)    data_set = sorted(set(data_series))    bins = np.append(data_set, data_set[-1]+1)    counts, bin_edges = np.histogram(data_series, bins=bins, density=False)    counts = counts.astype(float) / data_size    cdf = np.cumsum(counts)    plt.plot(bin_edges[0:-1], cdf, linestyle='--', marker="o", color='b')    plt.ylim((0,1))    plt.ylabel("CDF")    plt.grid(True)    plt.show()

二、对于类别特征的处理

主要有三种方式

如果该类别特征的数值具有序数关系（ordinality），即数值之间是可以比较大小的，则可以直接将其映射为数值特征

独热编码
最常见的处理方式。
若使用独热编码，则可能仅将出现频次较高的取值编码。即设置一个阈值，数据中某个取值出现次数低于阈值时丢弃，或者编码为一个特殊值【将出现次数少的所有取值都编码成同一个值】
【可以使用 hash trick 来减少内存的使用】

统计编码
最常见的统计编码是计数统计特征。统计每个类别在训练集（加上测试集）中的出现次数。
对样本的标签进行统计如Target Encoding 或者 Leave-One-Out Encoding，可能会产生信息泄露。

独热编码后可能产生高维稀疏特征

LR，线性 SVM 算法，学习每个特征对问题结果的影响程度，即与预测目标的线性关系。

FM， FFM 算法，学习二阶交叉特征对问题结果的影响程度。

GBDT 等树模型算法，可以学习到特征之间的更高阶的表示。

DeepFM，GBDT叶子结点 + LR（FFM），结合了低阶和高阶特征对问题结果的影响

关于 Target Encoding：

def add_noise(series, noise_level):    return series * (1 + noise_level * np.random.randn(len(series)))def target_encode(trn_series=None,    # Revised to encode validation series                  val_series=None,                  tst_series=None,                  target=None,                  min_samples_leaf=1,                  smoothing=1,                  noise_level=0):    """    Smoothing is computed like in the following paper by Daniele Micci-Barreca    https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf    trn_series : training categorical feature as a pd.Series    tst_series : test categorical feature as a pd.Series    target : target data as a pd.Series    min_samples_leaf (int) : minimum samples to take category average into account    smoothing (int) : smoothing effect to balance categorical average vs prior    """    assert len(trn_series) == len(target)    assert trn_series.name == tst_series.name    temp = pd.concat([trn_series, target], axis=1)    # Compute target mean    averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])    # Compute smoothing    smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))    # Apply average function to all target data    prior = target.mean()    # The bigger the count the less full_avg is taken into account    averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing    averages.drop(["mean", "count"], axis=1, inplace=True)    # Apply averages to trn and tst series    ft_trn_series = pd.merge(        trn_series.to_frame(trn_series.name),        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),        on=trn_series.name,        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)    # pd.merge does not keep the index so restore it    ft_trn_series.index = trn_series.index    ft_val_series = pd.merge(        val_series.to_frame(val_series.name),        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),        on=val_series.name,        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)    # pd.merge does not keep the index so restore it    ft_val_series.index = val_series.index    ft_tst_series = pd.merge(        tst_series.to_frame(tst_series.name),        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),        on=tst_series.name,        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)    # pd.merge does not keep the index so restore it    ft_tst_series.index = tst_series.index    return add_noise(ft_trn_series, noise_level), add_noise(ft_val_series, noise_level), add_noise(ft_tst_series, noise_level)

关于贝叶斯平滑：

import randomimport numpy as npimport pandas as pdimport scipy.special as specialclass HyperParam(object):    def __init__(self, alpha, beta):        self.alpha = alpha        self.beta = beta    def sample_from_beta(self, alpha, beta, num, imp_upperbound):        # 产生样例数据        sample = np.random.beta(alpha, beta, num)        I = []        C = []        for click_ratio in sample:            imp = random.random() * imp_upperbound            # imp = imp_upperbound            click = imp * click_ratio            I.append(imp)            C.append(click)        return pd.Series(I), pd.Series(C)    def update_from_data_by_FPI(self, tries, success, iter_num, epsilon):        # 更新策略        for i in range(iter_num):            new_alpha, new_beta = self.__fixed_point_iteration(tries, success, self.alpha, self.beta)            if abs(new_alpha - self.alpha) < epsilon and abs(new_beta - self.beta) < epsilon:                break            self.alpha = new_alpha            self.beta = new_beta    def __fixed_point_iteration(self, tries, success, alpha, beta):        # 迭代函数        sumfenzialpha = 0.0        sumfenzibeta = 0.0        sumfenmu = 0.0        sumfenzialpha = (special.digamma(success + alpha) - special.digamma(alpha)).sum()        sumfenzibeta = (special.digamma(tries - success + beta) - special.digamma(beta)).sum()        sumfenmu = (special.digamma(tries + alpha + beta) - special.digamma(alpha + beta)).sum()        return alpha * (sumfenzialpha / sumfenmu), beta * (sumfenzibeta / sumfenmu)

三、特征工程与特征选择

训练GBDT或者RF，将训练集中的特征的重要程度按从高到低排序

直接做交叉特征
将重要性程度高的特征进行乘法或除法计算。

将特征分为两部分，其中一部分特征做训练集，依次预测另一部分的每个特征的取值，将预测结果作为新的特征

Feature Aggregation
将重要性程度高的特征进行交叉统计
具体地讲，每次从重要性程度高选出两个特征，然后一个特征做分组变量，计算另一个特征的最值、均值、中位数、方差等。
Python new_feature = features.groupby('feature1')['features].mean()

常见的特征选择方法

穷尽搜索
优点是100%找到全集下面的最优子集，缺点是需要O(2^n)的时间复杂度

随机选择搜索
启发式，每次选取一部分特征训练，不断循环。计算量较小

mRMR 特征选择（最小冗余最大关联特征选择）

四、XGBoost 调参

初始化参数：
eta = 0.1，depth= 10，subsample=1.0，min_child_weight = 5，
col_sample_bytree = 0.5 (每棵树构造时随机采样的特征的占比，该值的设置与数据的特征的数量有关)
除了eta为0.1外，其他参数的初始化由具体问题决定。
选择合适的 objective 和 eval_metric。xgboost.train()中的obj和feval参数分别代表自定义的损失函数（目的函数）和评估函数。maximize参数表示是否对评估函数进行最大化

划分20%的数据作为验证集，设置较大的num_rounds，当验证集错误率开始上升时停止训练。
1. 调整depth
2. 调整subsample - subsample ratio of the training instance
3. 调整min_child_weight
4. 调整colsample_bytree - subsample ratio of columns when constructing each tree.
5. 最后，将eta调整到0.02，找到最适合的num_rounds

将通过以上步骤得到的参数作为baseline，再次基础上做一些微小的改变，使模型尽量得接近局部最优解。

五、特征融合

Voting - get the result voted most
1. Uniform
  A lower correlation between ensemble model members seems to result in an increase in the error-correcting capability.
2. Weighing
  give a better model more weight -> The only way for the inferior models to overrule the best model (expert) is for them to collectively (and confidently) agree on an alternative.

Averaging - bagging, reduces overfit
1. averaging
  average the submissions from multiple models
2. rank averaging
  first turn the predictions into ranks, then averaging these ranks.

stacking and blending
1. Stacked generalization
  The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.
  之前提到的ensemble方法都是定义一个公式或者方法来融合各种模型的预测结果。而stacking是通过另一个算法（分类器）来融合。
2. Blending
  不做交叉验证，而仅使用 holdout 数据。stacker 模型在 holdout 数据的预测结果上训练。
  相比 stacking 的优缺点
  优点：简单快速。不存在信息泄露【比如 holdout 10%，则第一层用 90% 的数据训练，第二层用 10% 的数据训练】
  缺点：使用的训练数据少，容易在holdout上过拟合，CV不够valid
3. Stacking with logistic regression
4. Stacking with non-linear algorithms
  Popular non-linear algorithms for stacking are GBM, KNN, NN, RF and ET.
  Non-linear stacking with the original features on multiclass problems gives surprising gains.
5. Feature weighted linear stacking
  先将提取后的特征用各个模型进行预测，然后使用一个线性的模型去学习出哪个个模型对于某些样本来说是最优的，通过将各个模型的预测结果加权求和完成