博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
数据挖掘算法比赛 - 简单经验总结
阅读量:4321 次
发布时间:2019-06-06

本文共 7818 字,大约阅读时间需要 26 分钟。

一、单个特征的EDA

  1. 对于 binary feature 和 categorical feature,train['feature_name'].value_counts().sort_index().plot(kind='bar')

  2. 对于 continuous numerical feature,

    def cdf_plot(data_series):    data_size = len(data_series)    data_set = sorted(set(data_series))    bins = np.append(data_set, data_set[-1]+1)    counts, bin_edges = np.histogram(data_series, bins=bins, density=False)    counts = counts.astype(float) / data_size    cdf = np.cumsum(counts)    plt.plot(bin_edges[0:-1], cdf, linestyle='--', marker="o", color='b')    plt.ylim((0,1))    plt.ylabel("CDF")    plt.grid(True)    plt.show()

二、对于类别特征的处理

主要有三种方式

  1. 如果该类别特征的数值具有序数关系(ordinality),即数值之间是可以比较大小的,则可以直接将其映射为数值特征

  2. 独热编码

    最常见的处理方式。
    若使用独热编码,则可能仅将出现频次较高的取值编码。即设置一个阈值,数据中某个取值出现次数低于阈值时丢弃,或者编码为一个特殊值【将出现次数少的所有取值都编码成同一个值】
    【可以使用 hash trick 来减少内存的使用】

  3. 统计编码

    最常见的统计编码是计数统计特征。统计每个类别在训练集(加上测试集)中的出现次数。
    对样本的标签进行统计如Target Encoding 或者 Leave-One-Out Encoding,可能会产生信息泄露。

独热编码后可能产生高维稀疏特征

  • LR,线性 SVM 算法,学习每个特征对问题结果的影响程度,即与预测目标的线性关系。
  • FM, FFM 算法,学习二阶交叉特征对问题结果的影响程度。
  • GBDT 等树模型算法,可以学习到特征之间的更高阶的表示。
  • DeepFM,GBDT叶子结点 + LR(FFM),结合了低阶和高阶特征对问题结果的影响

关于 Target Encoding:

def add_noise(series, noise_level):    return series * (1 + noise_level * np.random.randn(len(series)))def target_encode(trn_series=None,    # Revised to encode validation series                  val_series=None,                  tst_series=None,                  target=None,                  min_samples_leaf=1,                  smoothing=1,                  noise_level=0):    """    Smoothing is computed like in the following paper by Daniele Micci-Barreca    https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf    trn_series : training categorical feature as a pd.Series    tst_series : test categorical feature as a pd.Series    target : target data as a pd.Series    min_samples_leaf (int) : minimum samples to take category average into account    smoothing (int) : smoothing effect to balance categorical average vs prior    """    assert len(trn_series) == len(target)    assert trn_series.name == tst_series.name    temp = pd.concat([trn_series, target], axis=1)    # Compute target mean    averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])    # Compute smoothing    smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))    # Apply average function to all target data    prior = target.mean()    # The bigger the count the less full_avg is taken into account    averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing    averages.drop(["mean", "count"], axis=1, inplace=True)    # Apply averages to trn and tst series    ft_trn_series = pd.merge(        trn_series.to_frame(trn_series.name),        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),        on=trn_series.name,        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)    # pd.merge does not keep the index so restore it    ft_trn_series.index = trn_series.index    ft_val_series = pd.merge(        val_series.to_frame(val_series.name),        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),        on=val_series.name,        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)    # pd.merge does not keep the index so restore it    ft_val_series.index = val_series.index    ft_tst_series = pd.merge(        tst_series.to_frame(tst_series.name),        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),        on=tst_series.name,        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)    # pd.merge does not keep the index so restore it    ft_tst_series.index = tst_series.index    return add_noise(ft_trn_series, noise_level), add_noise(ft_val_series, noise_level), add_noise(ft_tst_series, noise_level)

关于贝叶斯平滑:

import randomimport numpy as npimport pandas as pdimport scipy.special as specialclass HyperParam(object):    def __init__(self, alpha, beta):        self.alpha = alpha        self.beta = beta    def sample_from_beta(self, alpha, beta, num, imp_upperbound):        # 产生样例数据        sample = np.random.beta(alpha, beta, num)        I = []        C = []        for click_ratio in sample:            imp = random.random() * imp_upperbound            # imp = imp_upperbound            click = imp * click_ratio            I.append(imp)            C.append(click)        return pd.Series(I), pd.Series(C)    def update_from_data_by_FPI(self, tries, success, iter_num, epsilon):        # 更新策略        for i in range(iter_num):            new_alpha, new_beta = self.__fixed_point_iteration(tries, success, self.alpha, self.beta)            if abs(new_alpha - self.alpha) < epsilon and abs(new_beta - self.beta) < epsilon:                break            self.alpha = new_alpha            self.beta = new_beta    def __fixed_point_iteration(self, tries, success, alpha, beta):        # 迭代函数        sumfenzialpha = 0.0        sumfenzibeta = 0.0        sumfenmu = 0.0        sumfenzialpha = (special.digamma(success + alpha) - special.digamma(alpha)).sum()        sumfenzibeta = (special.digamma(tries - success + beta) - special.digamma(beta)).sum()        sumfenmu = (special.digamma(tries + alpha + beta) - special.digamma(alpha + beta)).sum()        return alpha * (sumfenzialpha / sumfenmu), beta * (sumfenzibeta / sumfenmu)

三、特征工程与特征选择

训练GBDT或者RF,将训练集中的特征的重要程度按从高到低排序

  1. 直接做交叉特征

    将重要性程度高的特征进行乘法或除法计算。

  2. 将特征分为两部分,其中一部分特征做训练集,依次预测另一部分的每个特征的取值,将预测结果作为新的特征

  3. Feature Aggregation

    将重要性程度高的特征进行交叉统计
    具体地讲,每次从重要性程度高选出两个特征,然后一个特征做分组变量,计算另一个特征的最值、均值、中位数、方差等。
    Python new_feature = features.groupby('feature1')['features].mean()

常见的特征选择方法

  1. 穷尽搜索

    优点是100%找到全集下面的最优子集,缺点是需要O(2^n)的时间复杂度

  2. 随机选择搜索

    启发式,每次选取一部分特征训练,不断循环。计算量较小

  3. mRMR 特征选择(最小冗余最大关联特征选择)

四、XGBoost 调参

  1. 初始化参数:

    eta = 0.1,depth= 10,subsample=1.0,min_child_weight = 5,
    col_sample_bytree = 0.5 (每棵树构造时随机采样的特征的占比,该值的设置与数据的特征的数量有关)
    除了eta为0.1外,其他参数的初始化由具体问题决定。
    选择合适的 objective 和 eval_metric。xgboost.train()中的obj和feval参数分别代表自定义的损失函数(目的函数)和评估函数。maximize参数表示是否对评估函数进行最大化

  2. 划分20%的数据作为验证集,设置较大的num_rounds,当验证集错误率开始上升时停止训练。

    1. 调整depth
    2. 调整subsample - subsample ratio of the training instance
    3. 调整min_child_weight
    4. 调整colsample_bytree - subsample ratio of columns when constructing each tree.
    5. 最后,将eta调整到0.02,找到最适合的num_rounds
  3. 将通过以上步骤得到的参数作为baseline,再次基础上做一些微小的改变,使模型尽量得接近局部最优解。

五、特征融合

  1. Voting - get the result voted most

    1. Uniform

      A lower correlation between ensemble model members seems to result in an increase in the error-correcting capability.

    2. Weighing

      give a better model more weight -> The only way for the inferior models to overrule the best model (expert) is for them to collectively (and confidently) agree on an alternative.

  2. Averaging - bagging, reduces overfit

    1. averaging

      average the submissions from multiple models

    2. rank averaging

      first turn the predictions into ranks, then averaging these ranks.

  3. stacking and blending

    1. Stacked generalization

      The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.
      之前提到的ensemble方法都是定义一个公式或者方法来融合各种模型的预测结果。而stacking是通过另一个算法(分类器)来融合。

    2. Blending

      不做交叉验证,而仅使用 holdout 数据。stacker 模型在 holdout 数据的预测结果上训练。
      相比 stacking 的优缺点
      优点:简单快速。不存在信息泄露【比如 holdout 10%,则第一层用 90% 的数据训练,第二层用 10% 的数据训练】
      缺点:使用的训练数据少,容易在holdout上过拟合,CV不够valid

    3. Stacking with logistic regression

    4. Stacking with non-linear algorithms

      Popular non-linear algorithms for stacking are GBM, KNN, NN, RF and ET.
      Non-linear stacking with the original features on multiclass problems gives surprising gains.

    5. Feature weighted linear stacking

      先将提取后的特征用各个模型进行预测,然后使用一个线性的模型去学习出哪个个模型对于某些样本来说是最优的,通过将各个模型的预测结果加权求和完成

转载于:https://www.cnblogs.com/viredery/p/competition_how_to_deal_with_categorical_features.html

你可能感兴趣的文章
linux系统目录结构
查看>>
学习进度
查看>>
使用Postmark测试后端存储性能
查看>>
NSTextView 文字链接的定制化
查看>>
第五天站立会议内容
查看>>
最短路径(SP)问题相关算法与模板
查看>>
js算法之最常用的排序
查看>>
Python——交互式图形编程
查看>>
经典排序——希尔排序
查看>>
团队编程项目作业2-团队编程项目代码设计规范
查看>>
英特尔公司将停止910GL、915GL和915PL芯片组的生产
查看>>
Maven配置
查看>>
HttpServletRequest /HttpServletResponse
查看>>
SAM4E单片机之旅——24、使用DSP库求向量数量积
查看>>
从远程库克隆库
查看>>
codeforces Unusual Product
查看>>
hdu4348 - To the moon 可持久化线段树 区间修改 离线处理
查看>>
正则表达式的搜索和替换
查看>>
个人项目:WC
查看>>
地鼠的困境SSL1333 最大匹配
查看>>