目录 generated with DocToc
这是一个基于文字识别的表情包搜索引擎以及软件,可以基于关键字搜索云端表情包,或者上传本地表情包进行识别从而搜索,除此之外也可以爬取知乎回答中的表情包,然后上传到云端识别进而搜索更多的表情包,探索人类表情的复杂且矛盾又永远无法触及边界深刻关联体系,让我们一起愉快地沟通吧!
直接在搜索栏想要搜索的字串,点击云端搜索
直接搜索即可,会从云端搜索图片含有此字串的图片并进行展示,然后单击相应的图片即可复制。
打开知乎爬虫
,去知乎搜一些表情包问题,在下面的回答里面复制链接,粘贴即可爬取相应的答主的所有表情包图片
选择自己的图片目录,使用一键上传识别
功能,自动上传图片到腾讯云COS对象存储,进而在云端进行识别,永久地加入搜索池,然后进行云端搜索即可。
点击本地识别搜索
,打开本地使用窗口,点击下载|更新
即可下载云端图片,当云端图片增加时,还是点这个按钮,就会自动判断云端比本地多的图片从而更新下载。然后点击本地识别
就会将图片送到云端识别,返回该图片的文字然后自动保存相应的信息。在识别过程中就可以开始搜索啦~
win+q
打开搜索栏鼠标设置
选项其他鼠标设置
指针选项
,在下面的自动将指针移动到对话框中的默认按钮
前面打钩这样就可以在各种跳出的选项中,鼠标自动跳到对应的确定按钮上了,节省不少时间。
云端存储表情包
识别表情包文字并搜索
爬取知乎表情包
识别本地表情包
识别框架使用的PaddleOCR
,GUI
使用的PySimpleGUI
,具体请查看:main.py
,待更新。
需要 Python 3.7
运行:
pip install -r requirements.txt
然后打开 main.py
.
显示在线用户数
查看当前有哪些用户
增加运气一下
,随机搜索三张表情包
修复点击空白图片报错
一般在Ubuntu
上搭建完深度学习环境后,很多人习惯把Ubuntu
的X
桌面服务禁用掉;或者这个Ubuntu
本身就是在虚拟机中运行的,没有显示器,也就没有桌面服务了。然后通过另一台windows
系统的电脑通过ssh
来连接GPU
机器使用。这个时候X server
已经被禁用掉,开机也自动启动命令行模式,那就无法通过设置nvidia-settings
来调节风扇。原因是,nvidia-settings
只能在X
桌面环境下运行,若你想强行使用这个设置就会报错。
修改方法就是骗过系统,让它觉得你有显示器,这就是常说的headless
模式。
cd /optgit clone https://github.com/boris-dimitrov/set_gpu_fans_public# 改名sudo mv set_gpu_fans_public set-gpu-fans# 创建一个符号链接ln -sf ~/set-gpu-fans /opt/set-gpu-fans# 启动cd /opt/set-gpu-fanssudo tcsh./cool_gpu >& controller.log &tail -f controller.log
ps -ef |grep X
找到对应的pid
kill -9 pid
nvidia
驱动后,在没有插上显示器的情况下会发现后台无“/usr/lib/xorg/Xorg”
运行。此时我们使用nvidia-setting
控制风扇速度时会提示一些无法连接GUI的错误。解决这个问题的一个方法那就是使用虚拟屏幕来让后台运行Xorg
。edid.txt
文件sudo vim edid.txt
00 ff ff ff ff ff ff 00 1e 6d f5 56 71 ca 04 00 05 14 01 03 80 35 1e 78 0a ae c5 a2 57 4a 9c 25 12 50 54 21 08 00 b3 00 81 80 81 40 01 01 01 01 01 01 01 01 01 01 1a 36 80 a0 70 38 1f 40 30 20 35 00 13 2b 21 00 00 1a 02 3a 80 18 71 38 2d 40 58 2c 45 00 13 2b 21 00 00 1e 00 00 00 fd 00 38 3d 1e 53 0f 00 0a 20 20 20 20 20 20 00 00 00 fc 00 57 32 34 35 33 0a 20 20 20 20 20 20 20 01 3d 02 03 21 f1 4e 90 04 03 01 14 12 05 1f 10 13 00 00 00 00 23 09 07 07 83 01 00 00 65 03 0c 00 10 00 02 3a 80 18 71 38 2d 40 58 2c 45 00 13 2b 21 00 00 1e 01 1d 80 18 71 1c 16 20 58 2c 25 00 13 2b 21 00 00 9e 01 1d 00 72 51 d0 1e 20 6e 28 55 00 13 2b 21 00 00 1e 8c 0a d0 8a 20 e0 2d 10 10 3e 96 00 13 2b 21 00 00 18 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 26
xorg.conf
sudo nvidia-xconfig -a --allow-empty-initial-configuration \--use-display-device="DFP-0" --connected-monitor="DFP-0" \--custom-edid="DFP-0:/home/$USER/edid.txt" --cool-bits=28sudo reboot
nvidia-smi
命令下 Xorg
运行了起来,占用了极小的显存。现在就可以使用nvidia-setting
控制风扇与超频了。sudo DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 nvidia-settings -c :0 -a [gpu:GPUID]/GPUFanControlState=1sudo DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 nvidia-settings -c :0 -a [fan:GPUID]/GPUTargetFanSpeed=70
GPUID
改成你对应的0 1 2 3 4之类的显卡序号。sudo DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 nvidia-settings -c :0 -a [gpu:GPUID]/GPUGraphicsClockOffset[3]=64sudo DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 nvidia-settings -c :0 -a [gpu:GPUID]/GPUMemoryTransferRateOffset[3]=500
20
系显卡的话一般把[3]改成[4]。sudo vim fan.sh
#!/bin/bashheadless=trueverbose=falseif [ "$headless" = true ] ; then export DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0fi#Enable user defined fancontrol for all gpunvidia-settings -a "GPUFanControlState=1"while truedo #gpu index i=0 #Get GPU temperature of all cards for gputemp in $(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader);do if [ "$verbose" = true ] ; then echo "gpu ${i} temp ${gputemp}"; fi #Note: you need to set the minimum fan speed to a non-zero value, or it won't work #This fan profile is being used in my GTX580 (Fermi). Change it as necessary #If temperature is between X to Y degrees, set fanspeed to Z value case "${gputemp}" in 0[0-9]) newfanspeed="40" ;; 1[0-9]) newfanspeed="40" ;; 2[0-9]) newfanspeed="40" ;; 3[0-9]) newfanspeed="40" ;; 4[0-9]) newfanspeed="40" ;; 5[0-4]) newfanspeed="50" ;; 5[5-6]) newfanspeed="60" ;; 5[7-9]) newfanspeed="70" ;; 6[0-5]) newfanspeed="80" ;; 6[6-9]) newfanspeed="90" ;; 7[0-5]) newfanspeed="95" ;; 7[6-9]) newfanspeed="98" ;; *) newfanspeed="98" ;; esac nvidia-settings -a "[fan-${i}]/GPUTargetFanSpeed=${newfanspeed}" 2>&1 >/dev/null if [ "$verbose" = true ] ; then echo "gpu ${i} new fanspeed ${newfanspeed}"; fi sleep 3s #increment gpu index i=$(($i+1)) donedone
如果你的风扇是多控制器的,那么你还需要进行相应的修改,一般来说公版单风扇无论多少GPU都可以直接使用。
执行风扇程序:
sudo ./fan.sh
]]>
前提安装好Nvidia显卡,正常情况下nvidia settins
的Thermal settings
是没有显卡风扇转速调节选项的,我们的目的就是将手动控制风扇转速的选项调出来。
创建文件夹:
sudo mkdir /etc/X11/xorg.conf.d/
然后创建文件:
sudo vim /etc/X11/xorg.conf.d/nvidia.conf
如果文件时空的则粘贴,保存,退出,否则在区块间加入Option "Coolbits" "4"
Section "Device"Option "Coolbits" "4"EndSection
sudo reboot
;
再打开Thermal settings
可以看到调节选项。
我们要用显卡坞上的视频接口HDMI
或者DP
来输出(显卡端口输出),因为直接通过显卡端口输出可以节省显卡与cpu
的交流,
首先找到你的显卡PCI
地址:
lspci
找到这一行:
03:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2070 SUPER] (rev a1)
03
就是地址,十六进制表示,不过03的十进制还是03。
创建文件夹:
sudo mkdir /etc/X11/xorg.conf.d/
然后创建文件:
sudo vim /etc/X11/xorg.conf.d/nvidia.conf
在文件中粘贴,保存,退出:
Section "Device"Identifier "Videocard0"BusID "PCI:03:0:0" # 把03换成你的地址Driver "nvidia"VendorName "NVIDIA Corporation"Option "AllowEmptyInitialConfiguration"Option "AllowExternalGpus"EndSection
最后,修改10-nvidia.conf
文件
sudo vim /usr/share/X11/xorg.conf.d/10-nvidia.conf
原本是这样:
# This xorg.conf.d configuration snippet configures the X server to# automatically load the nvidia X driver when it detects a device driven by the# nvidia-drm.ko kernel module. Please note that this only works on Linux kernels# version 3.9 or higher with CONFIG_DRM enabled, and only if the nvidia-drm.ko# kernel module is loaded before the X server is started.Section "OutputClass" Identifier "nvidia" MatchDriver "nvidia-drm" Driver "nvidia" EndSection
加入Option "AllowExternalGpus" "true"
:
# This xorg.conf.d configuration snippet configures the X server to# automatically load the nvidia X driver when it detects a device driven by the# nvidia-drm.ko kernel module. Please note that this only works on Linux kernels# version 3.9 or higher with CONFIG_DRM enabled, and only if the nvidia-drm.ko# kernel module is loaded before the X server is started.Section "OutputClass" Identifier "nvidia" MatchDriver "nvidia-drm" Driver "nvidia" Option "AllowExternalGpus" "true"EndSection
按esc
退出,输入:wq
退出;
sudo reboot
;
等开机画面过后,你就可以看到你的外接屏幕上有画面了。
gcc
降级因为Ubuntu20.04
自带的gcc
版本为9(或者不自带gcc
),而CUDA10
不支持gcc-9,因此要手动安装gcc-7,命令如下:.
sudo apt-get install gcc-7 g++-7
设置优先使用gcc7
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 9sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-7 9
查看当前优先使用的版本:
sudo update-alternatives --display g++
Pytorch
深度学习环境与Nvidia
显卡Anaconda
Anaconda
推荐在清华源进行下载,我选择的是最新的Anaconda3-2020.11-Linux-x86_64.sh,注意版本。
复制相应的链接,然后输入下面命令:
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2020.11-Linux-x86_64.sh
下载完成后,输入:
sh Anaconda3-2020.11-Linux-x86_64.sh
一直按回车直到下方界面,提示安装路径,这里选择默认路径,直接回车即可,但是到是否将Anaconda3
加入环境变量那里,一定要记得输入yes
(我这里就是忘记输入,因为默认是No
),否则安装完后,还要重新配置:
(否则就就要编辑~/.bashrc
文件:)
sudo vim ~/.bahsrc
在最后加入下面的一段:
export PATH=“/home/excalibur/anaconda3/bin:$PATH”
其中excalibur
根据自己的用户名修改。
接下来验证是否安装成功:
source ~/.bashrcpython
conda
换源vim ~/.condarc
在文件中加入
channels: - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/ - defaults show_channel_urls: true
conda
创建新环境最好使用conda
创建一个新环境,在新环境中安装各种依赖库和使用代码。因此使用如下命令新建一个名称为“torch”的环境:
conda create --name torch python=3.8
可以使用source activate torch
来激活环境,可以使用source deactivate
来退出环境,让我们source activate torch
,此后安装pytorch
均在此环境中。
Ubuntu20.04
安装Nvidia
显卡驱动nouveau
为了避免sudo apt-get install nvidia-*
安装方式造成登录界面循环,安装nvidia
显卡驱动首先需要禁用nouveau
,不然会碰到冲突的问题,导致无法安装nvidia
显卡驱动。虽然说nouveau
貌似默认被禁用了,但是实战表明,只有通过手动禁用才能解决后续的一系列的问题。
编辑文件blacklist.conf
:
sudo gedit /etc/modprobe.d/blacklist.conf
在文件最后部分插入以下两行内容:
blacklist nouveauoptions nouveau modeset=0
更新系统:
sudo update-initramfs -u
重启sudo reboot
可以运行下列命令:
lsmod | grep nouveau
应该是没有输出的,虽然在禁用之前也是没有输出的。
Nvidia
驱动在“软件与更新”-“附加驱动”(Additional Drivers
)中,也能获取显卡驱动:
选择合适的驱动,apply
然后restart
即可。
在终端中输入:
nvidia-smi
可以看出支持低于11.1的任意CUDA
版本。
CUDA
tensorflow-gpu 2.0.0
,所以使用CUDA10.0
版本。在英伟达官网中寻找10.0的包,选择18.04的即可(虽然我们的系统是20.04)。runfile
格式最为方便,复制下载链接,在终端中输入:wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
连续回车看完协议,然后输入accept
,接下来会有一系列提示,提示是否安装显卡驱动那里选no
。(Graphics Driver)其他的均为yes
。默认的东西直接回车。
安装完毕后,配置环境变量:
vim ~/.bashrc
加入以下内容:
# add cudaexport PATH=/usr/local/cuda-10.0/bin:$PATHexport LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH
在终端输入:
source ~/.bashrc
刷新一下环境变量。然后检查CUDA
是否安装成功:
nvcc -V
出现信息即安装成功;
然后再编译测试:
cd ~/NVIDIA_CUDA-10.0_Samples/0_Simple/vectorAddmakesudo ./vectorAdd
提示测试通过。
cuDNN
7.4.1
版本,这里需要注册登陆。下载对应linux版本的cuDNN Library for Linux (x86_64)
在下载目录运行:
tar -zxf cudnn-10.0-linux-x64-v7.4.1.5.tgzcd cudasudo cp lib64/* /usr/local/cuda/lib64/sudo cp include/* /usr/local/cuda/include/
Pytorch
激活之前创建的torch
环境:
source activate torch
安装:
conda install pytorch torchvision cudatoolkit=10.0
由于更换了conda
源,因此下载会很快。正是因为换源了,所以命令末尾没有-c pytorch
。
在终端:
python
>>> import torch>>> >>> import torchvision>>> >>> torch.cuda.is_available()True
显示True
,证明pytorch
成功识别CUDA
。
Jupyter Notebook
服务介绍如何在自己本地浏览器使用jupyter notebook
连接远程服务器。
Ubuntu
服务器设置首先要在Ubuntu
服务器上的base
环境下安装Jupyter Notebook
:
pip install jupyter
生成Jupyter Notebook
配置文件:
jupyter notebook --generate-config
配置Jupyter Notebook
密码:
jupyter notebook password
配置jupyter_notebook_config.py
文件:
vim ~/.jupyter/jupyter_notebook_config.py
在最后一行后加入如下配置信息(vim
编辑器按A
键进行编辑):
c.NotebookApp.allow_remote_access = Truec.NotebookApp.open_browser = Falsec.NotebookApp.ip = '*'c.NotebookApp.allow_root = Truec.NotebookApp.port = 8888 #端口可以更改
添加完成后按ESC
,:wq
退出并保存,Ubuntu
服务器上的配置就完成了。
首先在Ubuntu
服务器上启动Jupyter Notebook
,这会保证服务一直开着,即使SSH
断开连接:
nohup jupyter notebook --no-browser --port=8889 --ip=127.0.0.1 &
然后在本地转发端口,用win+R
打开cmd
, 进入终端,输入:
ssh -N -f -L localhost:8888:localhost:8889 -p 22 remote_user@remote_host # 填写服务器用户名和IP
按照提示输入服务器密码即可,在本地浏览器网址栏输入http://127.0.0.1:8888,然后就可以看到jupyter-notebook
登录界面了。输入设置好的密码即可。
Jupyter Notebook
更换kernel
首先我们在anaconda3
的base
环境里,激活我们要用到的环境:
conda activate 环境名
然后安装ipykernel
:
conda install nb_conda_kernels
最后将环境写入Jupyter notebook
中的kernel
里:
python -m ipykernel install --user --name 环境名 --display-name "显示的名称"
完成。
Pycharm Pro
连接服务器运行程序Pycharm Pro
版本才具有此功能。首先打开项目文件如下:
Python Interpreter
File
—>Settings
—>Python Interpreter
,然后Add
建立新的解释器:点击SSH Interpreter
,建立新服务器连接,填入服务器IP
和用户名:
然后填入密码:111111
,下一步:
这里Interpreter
的默认路径:/usr/bin/python3
,链接的是基础的Anaconda3
自带的Python3
环境,如果想使用建立的Pytorch1.4.0
环境,则需要使用这个路径:/home/excalibur/anaconda3/envs/torch/bin/python
,这个/torch/
是我自己命名的,如果以后新建其他环境使用,整个路径只需要改变这个字段即可。然后勾上使用sudo
命令,然后还要修改一下本地代码部署到服务器的文件地址,为了养成良好的使用习惯,建议使用类似这样的地址:/pycharm_project/xxx
后面的xxx
就是你想要放的文件夹,最后Finish
,按照提示输入sudo
密码:111111
,然后Apply
,接着OK
。
此时可能会提示上传文件失败,permission denied,不用管。
Deployment
Tools
—>Deployment
—>Configuration...
点一下AUTODETECT
,这一步至关重要,然后再点Mappings
,主要看一下刚刚配置的Deployment Path
是否正确,然后OK
即可。
将鼠标放置在整个文件夹目录上,然后Tools
—>Deployment
—>Upload to ...
,就会把整个程序文件上传到你之前指定的目录那里。
等待上传完成。
然后Edit Configuration
。
确保Python Interpreter
是刚才设置的解释器。
直接点击绿色三角运行,或者右击文件run
运行即可,结果为:
Deployment
里面Download
到本地,Overwrite
本地的文件。当现有的环境中缺包时,先连接到服务器,工具有(Xshell
,FinalShell
,MobaXterm
)等。然后切换到那个环境,如切换到torch
环境:
conda activate torch
相应的,命令行前面的前缀会变成环境名称。
接着安装想要安装的包,如想要matplotlib
,直接:
pip install matplotlib
退出环境使用:
conda deactivate torch
当需要新的环境时,如需要建立名为tensorflow2.0.0
,Python
版本为3.8
的环境,直接:
conda create --name tensorflow2.0.0 python=3.8
至此,本文完结!✨✨✨
]]>如果你的ESXi
的系统不是最新的,一般是无法识别RTX
之类的显卡的,所以需要先下载补丁更新。
在 VMware Patch下载补丁,比如我的系统是ESXi6.7
,直接搜索即可,然后下载搜索结果第一个;
在ESXi6.7
通过文件管理,上传此文件;
在管理
—>服务
—>TSM-SSH
,开启SSH
服务,然后在主机
—>操作
—>进入维护模式
。
使用SSH
登录ESXi
,Windows可以直接在cmd
中运行:
ssh root@xx.xx.xx.xx # (服务器IP)
然后运行:
esxcli software vib install -d "/vmfs/volumes/datastore1/ESXi670-202011002.zip"# /datastore1/ 根据你的数据库名字更改# ESXi670-202011002.zip 根据你下载的文件名更改
补丁安装完后,重启服务器即可。
在管理
—>硬件
的PCI
设备中搜索nvidia
,一般会跳出四个结果,比如我这里有四个RTX 2070 Super
的设备,因为我已经直通过了,所以是活动状态。正常应该是灰色不可选的,此时只需勾选其中一个即可,勾选后会乱跳,应该是BUG不用管,直接点切换直通
即可;
重新引导主机;
Ubuntu
应用显卡选择你所要使用显卡的Ubuntu
系统,点击编辑
,然后添加PCI设备
,一共点四次,会出现后面的结果,并且记住后面的设备码。
有的时候,还需要设置一下内存预留,比如你的Ubuntu
虚拟机分配了8个G,就要预留8G内存;
还是在这个界面,在虚拟机选项
的高级
中,点击编辑
配置,然后添加参数
,键为:hypervisor.cpuid.v0
,值为FALSE
,即可;
在管理
的服务
中打开TSM-SSH
功能,然后Windows可以直接在cmd
中运行:
ssh root@xx.xx.xx.xx # (服务器IP)
连接后输入:
vi /etc/vmware/esx.conf
按下shift+G
直接跳转到最后,然后添加一段代码,这里的0000:3b:00.0
就是我们之前添加PCI设备
记下的设备码:
/device/0000:3b:00.0/owner = "passthru"
然后按下esc
,输入:wq
即可退出。
ESXi
重新引导。
进入Ubuntu系统,输入:
ubuntu-drivers devices
可以看到显卡设备型号,以及推荐的显卡驱动;
安装显卡驱动:
sudo apt-get install nvidia-driver-455
输入:
nvidia-smi
可以看到:
直通成功🎉🎉🎉。
连接腾讯云CVM,以及阿里云ECS可以用FinalShell或者Xshell,用Xshell的教程在这里保姆级教程——Xshell连接虚拟机中的Ubuntu并通过Xftp传输文件,连接本地Ubuntu和云端服务器步骤是一样的,只是ip输入的是公网ip。
在连接完毕后就可以进行后面的操作。
查看三个节点机器的IP地址:
VMware中Ubuntu18.04的IP,即运行
ifconfig
得到
这里的192.168.3.105
便是。
腾讯云CVM
需要用到的就是这里的公网ip:129.211.103.82
。
阿里云ECS
需要用到的就是这里的公网ip:47.96.189.80
。
IPFS
具体请看这篇文章:一文完全解决——Ubuntu20.04下源码构建安装IPFS环境
在最后运行:
ipfs init
注意一下输出信息:
也就是这里生成的.ipfs
文件在什么位置,不记得可以再运行一遍ipfs init
即可,这个位置后面要用到。
go-ipfs-swarm-key-gen
工具来生成共享key。我准备把本地的VMware的Ubuntu作为主运行节点,所以在这台Ubuntu上运行如下命令:#编译工具go get github.com/Kubuxu/go-ipfs-swarm-key-gen/ipfs-swarm-key-gencd $GOPATHcd src/github.com/Kubuxu/go-ipfs-swarm-key-gen/ipfs-swarm-key-gen/go build# 生成key./ipfs-swarm-key-gen > /home/excalibur/.ipfs/swarm.key
然后分别运行:
# 将本地生成的key拷贝到腾讯云服务器上的相同目录下scp /home/excalibur/.ipfs/swarm.key 192.168.3.105:/home/ubuntu/.ipfs/
# 将本地生成的key拷贝到阿里云服务器上的相同目录下scp /home/excalibur/.ipfs/swarm.key 47.96.189.80:/root/.ipfs/
/home/excalibur/.ipfs/swarm.key
这里面的/home/excalibur/.ipfs/
是我的ipfs
配置文件夹,你应该根据自己的位置修改,也就是之前提到的那个目录。47.96.189.80:/root/.ipfs/
,这里面前面的ip
地址要根据你服务器的修改,并且后面的/root/.ipfs/
也要根据你服务器上的ipfs
文件夹修改,可以运行ipfs init
进行查看。Permissioned denied
的情况,就输入su
进入管理员模式,重新运行上面两个scp
命令。boostrap
节点ipfs bootstrap rm --all
boostrap
节点信息ipfs id
得到:
我们需要这里的hash
值:QmTADgGT4MaCd3aTpD4vweGLQdWhr8oH8sue43DDioWBXA
,然后再加上之前的本地节点的ip地址:192.168.3.105
,就得到了所有需要的bootstrap
信息,然后分别在两台云服务器上运行如下命令:
ipfs bootstrap add /ip4/192.168.3.105/tcp/4001/ipfs/QmTADgGT4MaCd3aTpD4vweGLQdWhr8oH8sue43DDioWBXA
即可将本地节点作为它们的启动节点,自动加入ipfs
网络。
ipfs daemon
ipfs swarm peers
将看到其他网络内节点的运行信息,我这里是在本地Ubuntu上运行的命令,可以看到腾讯云服务器的节点信息,但是阿里云不在😅。
原因在于ECS的安全组设置:打开阿里云服务器设置,首先网络与安全组
,然后安全组配置
,然后配置规则
,手动添加
三个端口,分别是4001,5001,以及8080,最后ip地址可以是本地Ubuntu地址,或者直接设置成0.0.0.0/0
。
ipfs stats bitswap
可以看到
这里的partners
字段为1,说明当前网络有2个节点。
echo helloworld > hello.txtipfs add hello.txt
得到:
ipfs cat QmUU2HcUBVSXkfWPUc3WUSeCMrWWeEJTuAgR9uyWBhh9Nfipfs get QmUU2HcUBVSXkfWPUc3WUSeCMrWWeEJTuAgR9uyWBhh9Nf
可见,就很纳爱斯!😁😁😁
]]>如果是新装的Ubuntu系统,运行sudo
命令输入密码,可能会不成功,所以需要先运行:
sudo passwd
重置密码,即可。
Go
语言Go
IPFS
是基于Go
语言的项目,环境要求go version 1.14+。在Go
的官方网站下载最新的版本即可https://golang.org/dl/。
可以用以下命令:
wget https://golang.org/dl/go1.14.6.linux-amd64.tar.gztar -C /usr/local -xzf go1.14.6.linux-amd64.tar.gz
Tips:
wget
失败可以到官网https://golang.org/dl/go1.14.6.linux-amd64.tar.gz下载镜像,然后在那个目录下打开终端执行上面的命令的第二句。Xftp
连接虚拟机,将文件拖过去,至于如何连接,请看这篇文章保姆级教程——Xshell连接虚拟机中的Ubuntu并通过Xftp传输文件,Xshell
和Xftp
连接过程相同。Go
环境go
的文件夹,在go
的文件夹中建立三个子目录(名字必须为src
、pkg
和bin
)。创建目录过程如下:cd ~mkdir gocd gosudo mkdir srcsudo mkdir pkgsudo mkdir binsudo chmod 777 srcsudo chmod 777 pkgsudo chmod 777 binls -l
vi ~/.profile
export PATH=$PATH:/usr/local/go/bin export GOROOT=/usr/local/go export GOPATH=$HOME/go export PATH=$PATH:$HOME/go/bin
然后按Esc
退出,接着输入:wq
,然后输入回车就可以保存退出。
source ~/.profile
go versiongo env
~/.bashrc
中也设置一下:gedit ~/.bashrc
然后在最后面添加:
export PATH=$PATH:/usr/local/go/bin export GOROOT=/usr/local/go export GOPATH=$HOME/go export PATH=$PATH:$HOME/go/bin
最后再:
source ~/.bashrc
apt-get
并安装 git
terminal
执行以下语句:sudo apt-get updatesudo apt-get install git
go-ipfs
源码因为
go get
国内基本上下载不了,加上镜像的话例如:
go env -w GO111MODULE=ongo env -w GOPROXY=https://goproxy.cn,directgo get -u github.com/ipfs/go-ipfs
虽然可以很快地下载,但却下载到了
/go/pkg/mod/
的目录下,感觉很难受,所以不推荐这种下载方法。
git clone
的方法,但是如果直接clone
的是github
上的源码还是很慢,所以我采取的方法是,先将源码fork
到自己的仓库,然后再导入到码云,然后再从码云上clone
下来,速度简直快的飞起,可以直接用我的码云上的源码库,版本为ipfs 0.6.0
,操作如下:cd ~cd go/srcmkdir github.comcd github.commkdir ipfscd ipfsgit clone https://gitee.com/ExcaliburAias/go-ipfs.git
当然,也不用非得clone
到go/src/github.com/ipfs/go-ipfs
下面,直接clone
到桌面也可以。
go-ipfs
源码· 首先安装make
工具,然后安装gcc
,最后授予文件权限以及更改go get
的源,操作如下:
cd ~cd go/src/github.com/ipfs/go-ipfssudo apt updatesudo apt install makesudo apt install build-essentialsudo chmod 777 /usr/local/go/binsudo chmod 777 /plugin/loader/preload.gogo env -w GO111MODULE=ongo env -w GOPROXY=https://goproxy.cn,directmake install
· 测试:
ipfs version
最后建议设置回去,也就是:
go env -w GO111MODULE=off
install
,直接build
,即生成的ipfs.exe
不加入系统环境,而是生成在go/src/github.com/ipfs/go-ipfs/cmd/ipfs/ipfs.exe
这里。实现方法就是将最后的:make install
改为
make build
IPFS
的初始化和连接IPFS
节点:ipfs init
ipfs cat /ipfs/QmQPeNsJPyVWPFDVHb77w8G42Fvo15z4bG2X8D2GhfbSXc/readme
查看已经存储的readme文件
ipfs daemon
虚拟机——> 设置 ——>网络适配器——> 桥接模式
首先安装net-tools
(如已安装忽略),然后用ifconfig
查看ip地址。命令如下:
sudo apt install net-toolsifconfig
下图红框中就是我们后面需要的ip地址:
ssh
状态首先检测ssh
的状态:
ps -e | grep ssh
没有看到sshd
就说明未启动,选择下面的一种方式手动启动就好了:
sudo service sshd startsudo /etc/init.d/ssh start
如果没安装,就直接安装,安装完毕会自动启动:
sudo apt install openssh-server
Xshell
进行连接inet
后面的ip
地址:who
我这里,就是:excalibur
,密码就是sudo
的密码。
右键连接,用Xftp
打开,可以通过Xftp
进行文件传输:
直接把左面的主机的文件往虚拟机里面拖就行了。
目前地铁一般采用如下的单一交路:
目前,我国绝大多数城市都采用这种交路形式,但是当断面客流量分布不均匀时容易造成线路运能浪费,客流拥挤。
替代方案就是用大小交路:
使用遗传算法程序就是在既定的OD矩阵下找到最优的大小交路的往返站$S_a,S_b$以及相应的大小交路的发车频率$f_1,f_2$,也就是在遗传算法每次运行中,根据不同的大小交路折返站的设置,划分预定的OD出行矩阵,然后计算目标函数,判断是否达到最优。
其中
其中
MATLAB
程序所有的程序以及数据,OD出行矩阵以及区间运行时间在:列车交路方案优化遗传算法程序
myself.m
——主脚本主脚本,OD
矩阵可以使用od.m
脚本随机生成。
clear;clc;close all;%% 生成随机OD矩阵%od()%%遗传参数设置NUMPOP=200;%初始种群大小irange_l=1; %问题解区间irange_r=35; LENGTH=24; %二进制编码长度ITERATION = 10000;%迭代次数CROSSOVERRATE = 0.8;%杂交率SELECTRATE = 0.4;%选择率VARIATIONRATE = 0.2;%变异率OD = xlsread('OD.xlsx');% 苏州地铁2号线调查问卷OD出行矩阵h = xlsread('区间运行时间.xlsx'); % 苏州地铁2号线区间长度及运行时分%初始化种群pop=m_InitPop(NUMPOP,irange_l,irange_r);pop_save=pop;fitness_concat = [];best_solution = [];%开始迭代for time=1:ITERATION %计算初始种群的适应度 fitness=m_Fitness(pop, OD, h); fitness_concat = [fitness_concat;max(fitness)]; pop_T = pop'; [m,index] = max(m_Fitness(pop, OD, h)); best_solution = [best_solution;pop(:,index)']; %选择 pop=m_Select(fitness,pop,SELECTRATE); %编码 binpop=m_Coding(pop,LENGTH,irange_l); %交叉 kidsPop = crossover(binpop,NUMPOP,CROSSOVERRATE); %变异 kidsPop = Variation(kidsPop,VARIATIONRATE); %解码 kidsPop=m_Incoding(kidsPop,irange_l); %更新种群 pop=[pop kidsPop];enddisp(['最优解:' num2str(min(m_Fx(pop,OD))) '分钟']);disp(['最优解对应的各参数:' num2str(pop(1,1)) ',' num2str(pop(2,1)) ',' num2str(pop(3,1)) ',' num2str(pop(4,1)) ]);disp(['最大适应度:' num2str(max(m_Fitness(pop, OD, h)))]); figure% set(gca,'looseInset',[0 0 0 0]);set(gcf,'outerposition',get(0,'screensize'));loglog(1:ITERATION, fitness_concat, 'Blue*-','linewidth',2)legend('{\bf最优适应度值}');xlabel('{\bf进化代数}','fontsize',30);ylabel('{\bf最优适应度}','fontsize',30);set(gca,'FontSize',20,'Fontname', 'Times New Roman');set(get(gca,'XLabel'),'Fontsize',20,'Fontname', '宋体');set(get(gca,'YLabel'),'Fontsize',20,'Fontname', '宋体');set(get(gca,'legend'),'Fontsize',20,'Fontname', '宋体');set(get(gca,'title'),'Fontsize',20,'Fontname', '宋体');set(gca,'linewidth',2); print(gcf,'-dpng','-r300','最优适应度值-进化代数');figure% set(gca,'looseInset',[0 0 0 0]);set(gcf,'outerposition',get(0,'screensize'));semilogx(1 : ITERATION, best_solution,'linewidth',4)legend('{\bf大小交路折返站a}','{\bf大小交路折返站b}','{\bf大交路发车频率f_1}','{\bf小交路发车频率f_2}');% text(6, 0.3, '$\leftarrow y= 2^{-x}$', 'HorizontalAlignment', 'left', 'Interpreter', 'latex', 'FontSize', 15);xlabel('{\bf进化代数}','fontsize',15);ylabel('{\bf参数各代最优值}','fontsize',15);set(gca,'FontSize',20,'Fontname', 'Times New Roman');set(get(gca,'XLabel'),'Fontsize',20,'Fontname', '宋体');set(get(gca,'YLabel'),'Fontsize',20,'Fontname', '宋体');set(get(gca,'legend'),'Fontsize',20,'Fontname', '宋体');set(get(gca,'title'),'Fontsize',20,'Fontname', '宋体');set(gca,'linewidth',2); print(gcf,'-dpng','-r300','参数各代最优值-进化代数');
od.m
——生成随机出行OD
矩阵用来生成随机出行OD
矩阵。
Mu = 26;sigma = 10;N = round(normrnd(Mu, sigma, [35 35]));N = N + abs(min(N));sum(sum(N))if sum(sum(N)) > 35000 ; if sum(sum(N)) < 40000; xlswrite('test.xlsx',N,'Sheet1') endend
m_InitPop.m
——初始化种群function pop=m_InitPop(numpop,irange_l,irange_r)%% 初始化种群% 输入:numpop--种群大小;% [irange_l,irange_r]--初始种群所在的区间pop=[];for j = 1:numpop for i=1:4 % 因为a,b,f1,f2要求整数,所以生成随机整数 pop(i,j)= round(irange_l+(irange_r-irange_l)*rand); endend
m_Select.m
——选择function parentPop=m_Select(matrixFitness,pop,SELECTRATE)%% 选择% 输入:matrixFitness--适应度矩阵% pop--初始种群% SELECTRATE--选择率sumFitness=sum(matrixFitness(:));%计算所有种群的适应度accP=cumsum(matrixFitness/sumFitness);%累积概率%轮盘赌选择算法for n=1:round(SELECTRATE*size(pop,2)) matrix=find(accP>rand); %找到比随机数大的累积概率 if isempty(matrix) continue end parentPop(:,n)=pop(:,matrix(1));%将首个比随机数大的累积概率的位置的个体遗传下去endend
Crossover.m
——交叉%% 子函数%%题 目:Crossover%%%%输 入:% parentsPop 上一代种群% NUMPOP 种群大小% CROSSOVERRATE 交叉率%输 出:% kidsPop 下一代种群%%% function kidsPop = Crossover(parentsPop,NUMPOP,CROSSOVERRATE)kidsPop = {[]};n = 1;while size(kidsPop,2)<NUMPOP-size(parentsPop,2) %选择出交叉的父代和母代 father = parentsPop{1,ceil((size(parentsPop,2)-1)*rand)+1}; mother = parentsPop{1,ceil((size(parentsPop,2)-1)*rand)+1}; %随机产生交叉位置 crossLocation = ceil((length(father)-1)*rand)+1; %如果随即数比交叉率低,就杂交 if rand<CROSSOVERRATE father(1,crossLocation:end) = mother(1,crossLocation:end); kidsPop{n} = father; n = n+1; endend
Variation.m
——变异%% 子函数%%题 目:Variation%%%输 入:% pop 种群% VARIATIONRATE 变异率%输 出:% pop 变异后的种群%% function kidsPop = Variation(kidsPop,VARIATIONRATE)for n=1:size(kidsPop,2) if rand<VARIATIONRATE temp = kidsPop{n}; %找到变异位置 location = ceil(length(temp)*rand); temp = [temp(1:location-1) num2str(~temp(location))... temp(location+1:end)]; kidsPop{n} = temp; endend
m_Coding.m
——编码因为总共$35$座车站,$a,b,f_1,f_2$都不超过$35<2^6$,所以$4$个参数都设置为$6$位二进制,这样编码总长度为$24$。
function binPop=m_Coding(pop,pop_length,irange_l)%% 二进制编码(生成染色体)% 输入:pop--种群% pop_length--编码长度for n=1:size(pop,2) %列循环 binPop{n} = ''; for k=1:size(pop,1) %行循环 substr = dec2bin(pop(k,n)); lengthpop = length(substr); for s = 1:6-lengthpop substr = ['0' substr]; end binPop{n} = [binPop{n} substr]; endend
m_Incoding.m
——解码解码时编码长度为$24$,每隔$6$位转化成十进制。
function pop=m_Incoding(binPop,irange_l)%% 解码popNum=1;popNum = 4;%染色体包含的参数数量for n=1:size(binPop,2) % 因为有35个车站,35<2^6 ,所以编码为6位 pop(1,n) = bin2dec(binPop{1,n}(1:6)); pop(2,n) = bin2dec(binPop{1,n}(7:12)); pop(3,n) = bin2dec(binPop{1,n}(13:18)); pop(4,n) = bin2dec(binPop{1,n}(19:24));end% pop = pop./10^6+irange_l;
m_Fitness.m
——适应度函数(重要,实现约束条件)在这里实现约束条件,思路就是不满足约束条件的种群的适应度设置为无穷小,那么在下一代的迭代中就会将适应度低的种群淘汰掉,实现约束的目的。
function fitness=m_Fitness(pop, OD, h)%% Fitness Functionfor n=1:size(pop,2) a = pop(1,n); b = pop(2,n); f1 = pop(3,n); f2 = pop(4,n);%% 约束条件,不满足约束则适应度值无穷小 %% 1) a,b,f1,f2 不能为0 if a == 0 || b == 0 || f1 == 0 || f2 == 0 fitness(n) = 1/1000000000; continue; end %% 2) a,b,f1,f2 不能超过35 if a > 35 || b > 35 || f1 >35 || f2 >35 fitness(n) = 1/1000000000; continue; end%% 3) 列车数量约束 if (sum(h) * 120 + 1170) *( f1 - 16) + (sum(h(a: b-1)) + (b - a + 1) * 30 + 120) * f2 > 0 fitness(n) = 1/1000000000; continue; end%% 4) 满载率约束 % constraint2 = [];% for j = 2:33% constraint2(j) = (sum(sum(OD(1:j, j+1:35)))/(f1+f2)) * (sum(sum(OD(j+1:35,1:j)))/(f1+f2));% end% if max(constraint2) > 1 * 1460% fitness(n) = 1/1000000000;% continue;% end%% 5) 最小追踪间隔 if f1 + f2 > 30 fitness(n) = 1/1000000000; continue; end %% 5) 最小发车间隔 if f1 < 12 fitness(n) = 1/1000000000; continue; end%% 主要适应度函数,设置为目标函数的倒数,即目标函数要求最小,那么越小,适应度就越大 fitness(n)= 1/m_Fx(pop(:,n), OD);end
m_Fx.m
——目标函数(重要)function y=m_Fx(x, OD)%% 要求解的函数%% Z = Q1 * t1d + Q2 * t2d y = (sum(sum(OD)) - sum(sum(OD(x(1):x(2),x(1):x(2))))) * (30/x(3)) + sum(sum(OD(x(1):x(2),x(1):x(2)))) * (30/(x(3)+x(4)));end
最优解:71335.4762分钟最优解对应的各参数:4,32,14,4最大适应度:1.4018e-05
即设置第$4$和第$32$个站点为大小交路折返站,大交路发车频率为$14$列/小时,小交路发车频率为$4$列/小时,最低平均等待时间为$71335$分钟。
图像结果:
Dinar
的页面;import timeimport reimport osimport jsonimport requestsimport pandas as pdfrom bs4 import BeautifulSoupfrom alive_progress import alive_barurl_index_prefix = "http://example.webscraping.com/places/default/index/"url_info_prefix = "http://example.webscraping.com"inverted_index_dict = {}length = 252header = { 'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 'Accept-Encoding': "gzip, deflate", 'Accept-Language': "zh-CN,zh;q=0.9,en;q=0.8", 'Cache-Control': "max-age=0", 'Cookie': "session_id_places=True; session_data_places=\"586ad5c755d830e432c6e80f8b9a822a:xLrqTGkuTTFaRdOtQTpde-UcgSMy7nwOrXyEeyRafNjWT8t7J\ bHjZGf1cYO6bcnIVhwOHVNpJiMnr32rtSCF2_RSOUfBX4gRmU09KTNfMczD2vc4aaloPAvNE6gLStboj-EBBnFkWhVP3uCd8woSyXnTQwYi39HKoujz4iX1tJA5O4dr7z3VCn22mvev_\ MZaNSW4TT1jTJUZoF_3hyqtoN8rTL_Mjpu02ACJscaG6lRfQmIOBZ-BloR7aT4s-it19e0JYkbpynKb-an8f72IRhiN-thhyXeYbo6SCX0LzAra6Il1zM4Zpw9GkQFU2yha", 'Host': "example.webscraping.com", 'Proxy-Connection': "keep-alive", 'Referer': "http://example.webscraping.com/places/default/index/1", 'Upgrade-Insecure-Requests': "1", 'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"}def SplitList2Words(WordList): words = [] for phrase in WordList: phrase = str(phrase) if " " in phrase: for word in phrase.split(" "): if word != '>': # del '>' from words words.append(word) elif "." in phrase: words.append(phrase.split(".")[1]) # del '.' from words elif "," in phrase: for word in phrase.split(","): # del ',' from words words.append(word) else: words.append(phrase) return wordsdef GenerateInvertedIndex(words, inverted_index_dict, url): for w in words: if w in inverted_index_dict.keys(): if url in inverted_index_dict[w].keys(): # add to the number of occurences inverted_index_dict[w][url] += 1 else: # add this page link to the word dict value inverted_index_dict[w][url] = 1 else: # did not encounter this word before inverted_index_dict[w] = {url: 1} js = json.dumps(inverted_index_dict) file = open('InvertedIndex.txt', 'w') file.write(js) file.close() return inverted_index_dictdef build(): inverted_index_dict = {} with alive_bar(length) as bar: for page in range(26): header['Referer'] = url_index_prefix while True: try: # Set timeout to 10 seconds r = requests.get(url_index_prefix + str(page), headers=header, timeout=10) break except: print("TimeOut Error, reconnecting...") time.sleep(2) soup_main = BeautifulSoup(r.text, 'lxml') main_word_list = [str(w.text).strip() for w in soup_main.find_all('a')] main_word_list.append(str(soup_main.h1.text).strip()) words = SplitList2Words(main_word_list) inverted_index_dict = GenerateInvertedIndex( words, inverted_index_dict, url_index_prefix + str(page)) url_suffix = re.findall(r'/places/default/view/[A-Za-z]+\S+[A-Za-z]+[-]+[0-9]+', r.text) print("Start to crawl page %d !" % (page)) time.sleep(2) for info_url in url_suffix: bar() info_word_list = [] crawl_url = url_info_prefix + info_url header['Referer'] = url_index_prefix + str(page) while True: try: r_info = requests.get(crawl_url, headers=header, timeout=10) break except: print("TimeOut Error, reconnecting...") time.sleep(2) soup_info = BeautifulSoup(r_info.text, 'lxml') info_word_list = [str(w.text).strip() for w in soup_info.find_all('a')] info_word_list.append(str(soup_info.h1.text).strip()) title_info = SplitList2Words(info_word_list) inverted_index_dict = GenerateInvertedIndex( title_info, inverted_index_dict, crawl_url) country = pd.read_html(r_info.text) country_title = [x.split(":")[0] for x in country[0][0]] country_info = [info for info in country[0][1]] inverted_index_dict = GenerateInvertedIndex( country_title, inverted_index_dict, crawl_url) country_info = SplitList2Words(country_info) inverted_index_dict = GenerateInvertedIndex( country_info, inverted_index_dict, crawl_url) print("Country \"%s\" was crawled!" % (country[0][1][4])) time.sleep(5) print("Finished crawling page %d" % (page))def load(): file = open('InvertedIndex.txt', 'r') js = file.read() dic = json.loads(js) file.close() return dicif __name__ == "__main__": while True: command = input() if command == 'build': build() elif command == 'load': if not os.path.exists('InvertedIndex.txt'): print("Can not find the Inverted Index File, please 'build' first!") else: inverted_index_dict = load() print("Load file 'InvertedIndex.txt' successfully!") elif command.split(" ")[0] == 'print': try: beautiful_format = json.dumps(inverted_index_dict[command.split(" ")[1]], indent=4, ensure_ascii=False) print(beautiful_format) except: print("The index \'%s\' doesn't exist!" % (command.split(" ")[1])) elif command.split(" ")[0] == 'find': try: if len(command.split(" ")) == 2: for i in list(inverted_index_dict[command.split(" ")[1]].keys()): print(i) except: print("The index \'%s\' doesn't exist!" % (command[5:])) else: try: url_list = [] for word in command.split(" ")[1:]: url_list.append(list(inverted_index_dict[word].keys())) for i in set(url_list[0]).intersection(*url_list[1:]): print(i) except: print("The intersection of \'%s\' doesn't exist!" % (command[5:])) else: print("Please input the right command!")
]]>
本章思维导图:
这是一道来自于天池的新手练习题目,用数据分析
、机器学习
等手段进行 二手车售卖价格预测 的回归问题。赛题本身的思路清晰明了,即对给定的数据集进行分析探讨,然后设计模型运用数据进行训练,测试模型,最终给出选手的预测结果。前面我们已经进行过EDA分析在这里天池_二手车价格预测_Task1-2_赛题理解与数据分析
以及天池_二手车价格预测_Task3_特征工程
赛题官方给出了来自Ebay Kleinanzeigen的二手车交易记录,总数据量超过40w,包含31列变量信息,其中15列为匿名变量,即v0
至v15
。并从中抽取15万条作为训练集,5万条作为测试集A,5万条作为测试集B,同时对name
、model
、brand
和regionCode
等信息进行脱敏。具体的数据表如下图:
Field | Description |
---|---|
SaleID | 交易ID,唯一编码 |
name | 汽车交易名称,已脱敏 |
regDate | 汽车注册日期,例如20160101,2016年01月01日 |
model | 车型编码,已脱敏 |
brand | 汽车品牌,已脱敏 |
bodyType | 车身类型:豪华轿车:0,微型车:1,厢型车:2,大巴车:3,敞篷车:4,双门汽车:5,商务车:6,搅拌车:7 |
fuelType | 燃油类型:汽油:0,柴油:1,液化石油气:2,天然气:3,混合动力:4,其他:5,电动:6 |
gearbox | 变速箱:手动:0,自动:1 |
power | 发动机功率:范围 [ 0, 600 ] |
kilometer | 汽车已行驶公里,单位万km |
notRepairedDamage | 汽车有尚未修复的损坏:是:0,否:1 |
regionCode | 地区编码,已脱敏 |
seller | 销售方:个体:0,非个体:1 |
offerType | 报价类型:提供:0,请求:1 |
creatDate | 汽车上线时间,即开始售卖时间 |
price | 二手车交易价格(预测目标) |
v系列特征 | 匿名特征,包含v0-14在内15个匿名特征 |
为了后面处理数据提高性能,所以需要对其进行内存优化。
import pandas as pdimport numpy as npimport warningswarnings.filterwarnings('ignore')
def reduce_mem_usage(df): """ 迭代dataframe的所有列,修改数据类型来减少内存的占用 """ start_mem = df.memory_usage().sum() print('Memory usage of dataframe is {:.2f} MB'.format(start_mem)) for col in df.columns: col_type = df[col].dtype if col_type != object: c_min = df[col].min() c_max = df[col].max() if str(col_type)[:3] == 'int': # 判断可以用哪种整型就可以表示,就转换到那个整型去 if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max: df[col] = df[col].astype(np.int8) elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max: df[col] = df[col].astype(np.int16) elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max: df[col] = df[col].astype(np.int32) elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max: df[col] = df[col].astype(np.int64) else: if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max: df[col] = df[col].astype(np.float16) elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max: df[col] = df[col].astype(np.float32) else: df[col] = df[col].astype(np.float64) else: df[col] = df[col].astype('category') end_mem = df.memory_usage().sum() print('Memory usage after optimization is: {:.2f} MB'.format(end_mem)) print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem)) return df
sample_feature = reduce_mem_usage(pd.read_csv('../excel/data_for_tree.csv'))
Memory usage of dataframe is 35249888.00 MBMemory usage after optimization is: 8925652.00 MBDecreased by 74.7%
continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model']]
sample_feature.head()
SaleID | name | model | brand | bodyType | fuelType | gearbox | power | kilometer | notRepairedDamage | ... | used_time | city | brand_amount | brand_price_max | brand_price_median | brand_price_min | brand_price_sum | brand_price_std | brand_price_average | power_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2262 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | - | ... | 4756.0 | 4.0 | 4940.0 | 9504.0 | 3000.0 | 149.0 | 17934852.0 | 2538.0 | 3630.0 | NaN |
1 | 5 | 137642 | 24.0 | 10 | 0.0 | 1.0 | 0.0 | 109 | 10.0 | 0.0 | ... | 2482.0 | 3.0 | 3556.0 | 9504.0 | 2490.0 | 200.0 | 10936962.0 | 2180.0 | 3074.0 | 10.0 |
2 | 7 | 165346 | 26.0 | 14 | 1.0 | 0.0 | 0.0 | 101 | 15.0 | 0.0 | ... | 6108.0 | 4.0 | 8784.0 | 9504.0 | 1350.0 | 13.0 | 17445064.0 | 1798.0 | 1986.0 | 10.0 |
3 | 10 | 18961 | 19.0 | 9 | 3.0 | 1.0 | 0.0 | 101 | 15.0 | 0.0 | ... | 3874.0 | 1.0 | 4488.0 | 9504.0 | 1250.0 | 55.0 | 7867901.0 | 1557.0 | 1753.0 | 10.0 |
4 | 13 | 8129 | 65.0 | 1 | 0.0 | 0.0 | 0.0 | 150 | 15.0 | 1.0 | ... | 4152.0 | 3.0 | 4940.0 | 9504.0 | 3000.0 | 149.0 | 17934852.0 | 2538.0 | 3630.0 | 14.0 |
5 rows × 39 columns
continuous_feature_names
['SaleID', 'name', 'bodyType', 'fuelType', 'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'seller', 'offerType', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14', 'train', 'used_time', 'city', 'brand_amount', 'brand_price_max', 'brand_price_median', 'brand_price_min', 'brand_price_sum', 'brand_price_std', 'brand_price_average', 'power_bin']
设置训练集的自变量train_X
与因变量train_y
sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True)sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32)train = sample_feature[continuous_feature_names + ['price']]train_X = train[continuous_feature_names]train_y = train['price']
sklearn.linear_model
库调用线性回归函数from sklearn.linear_model import LinearRegression
训练模型,normalize
设置为True
则输入的样本数据将$$\frac{(X-X_)}{||X||}$$
model = LinearRegression(normalize=True)model = model.fit(train_X, train_y)
查看训练的线性回归模型的截距(intercept)与权重(coef),其中zip
先将特征与权重拼成元组,再用dict.items()
将元组变成列表,lambda
里面取元组的第2个元素,也就是按照权重排序。
print('intercept:'+ str(model.intercept_))sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)
intercept:-74792.9734982533[('v_6', 1409712.605060366), ('v_8', 610234.5713666412), ('v_2', 14000.150601494915), ('v_10', 11566.15879987477), ('v_7', 4359.400479384727), ('v_3', 734.1594753553514), ('v_13', 429.31597053081543), ('v_14', 113.51097451363385), ('bodyType', 53.59225499923475), ('fuelType', 28.70033988480179), ('power', 14.063521207625223), ('city', 11.214497244626225), ('brand_price_std', 0.26064581249034796), ('brand_price_median', 0.2236946027016186), ('brand_price_min', 0.14223892840381142), ('brand_price_max', 0.06288317241689621), ('brand_amount', 0.031481415743174694), ('name', 2.866003063271253e-05), ('SaleID', 1.5357186544049832e-05), ('gearbox', 8.527422323822975e-07), ('train', -3.026798367500305e-08), ('offerType', -2.0873267203569412e-07), ('seller', -8.426140993833542e-07), ('brand_price_sum', -4.1644253886318015e-06), ('brand_price_average', -0.10601622599106471), ('used_time', -0.11019174518618283), ('power_bin', -64.74445582883024), ('kilometer', -122.96508938774225), ('v_0', -317.8572907738245), ('notRepairedDamage', -412.1984812088826), ('v_4', -1239.4804712396635), ('v_1', -2389.3641453624136), ('v_12', -12326.513672033445), ('v_11', -16921.982011390297), ('v_5', -25554.951071390704), ('v_9', -26077.95662717417)]
长尾分布是尾巴很长的分布。那么尾巴很长很厚的分布有什么特殊的呢?有两方面:一方面,这种分布会使得你的采样不准,估值不准,因为尾部占了很大部分。另一方面,尾部的数据少,人们对它的了解就少,那么如果它是有害的,那么它的破坏力就非常大,因为人们对它的预防措施和经验比较少。实际上,在稳定分布家族中,除了正态分布,其他均为长尾分布。
随机找个特征,用随机下标选取一定的数观测预测值与真实值之间的差别
from matplotlib import pyplot as pltsubsample_index = np.random.randint(low=0, high=len(train_y), size=50)plt.scatter(train_X['v_6'][subsample_index], train_y[subsample_index], color='black')plt.scatter(train_X['v_6'][subsample_index], model.predict(train_X.loc[subsample_index]), color='red')plt.xlabel('v_6')plt.ylabel('price')plt.legend(['True Price','Predicted Price'],loc='upper right')print('真实价格与预测价格差距过大!')plt.show()
真实价格与预测价格差距过大!<Figure size 640x480 with 1 Axes>
绘制特征v_6
的值与标签的散点图,图片发现模型的预测结果(红色点)与真实标签(黑色点)的分布差异较大,且部分预测值出现了小于0的情况,说明我们的模型存在一些问题。
下面可以通过作图我们看看数据的标签(price
)的分布情况
import seaborn as snsplt.figure(figsize=(15,5))plt.subplot(1,2,1)sns.distplot(train_y)plt.subplot(1,2,2)sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])# 去掉尾部10%的数再画一次,依然是呈现长尾分布
<matplotlib.axes._subplots.AxesSubplot at 0x210469a20f0>
从这两个频率分布直方图来看,price
呈现长尾分布,不利于我们的建模预测,原因是很多模型都假设数据误差项符合正态分布,而长尾分布的数据违背了这一假设。
在这里我们对train_y
进行了$log(x+1)$变换,使标签贴近于正态分布
train_y_ln = np.log(train_y + 1)plt.figure(figsize=(15,5))plt.subplot(1,2,1)sns.distplot(train_y_ln)plt.subplot(1,2,2)sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])
<matplotlib.axes._subplots.AxesSubplot at 0x21046aa7588>
可以看出经过对数处理后,长尾分布的效果减弱了。再进行一次线性回归:
model = model.fit(train_X, train_y_ln)print('intercept:'+ str(model.intercept_))sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)
intercept:22.237755141260187[('v_1', 5.669305855573455), ('v_5', 4.244663233260515), ('v_12', 1.2018270333465797), ('v_13', 1.1021805892566767), ('v_10', 0.9251453991435046), ('v_2', 0.8276319426702504), ('v_9', 0.6011701859510072), ('v_3', 0.4096252333799574), ('v_0', 0.08579322268709569), ('power_bin', 0.013581489882378468), ('bodyType', 0.007405158753814581), ('power', 0.0003639122482301998), ('brand_price_median', 0.0001295023112073966), ('brand_price_max', 5.681812615719255e-05), ('brand_price_std', 4.2637652140444604e-05), ('brand_price_sum', 2.215129563552113e-09), ('gearbox', 7.094911325111752e-10), ('seller', 2.715054847612919e-10), ('offerType', 1.0291500984749291e-10), ('train', -2.2282620193436742e-11), ('SaleID', -3.7349069125800904e-09), ('name', -6.100613320903764e-08), ('brand_amount', -1.63362003323235e-07), ('used_time', -2.9274637535648837e-05), ('brand_price_min', -2.97497751376125e-05), ('brand_price_average', -0.0001181124521449396), ('fuelType', -0.0018817210167693563), ('city', -0.003633315365347111), ('v_14', -0.02594698320698149), ('kilometer', -0.03327227857575015), ('notRepairedDamage', -0.27571086049472), ('v_4', -0.6724689959780609), ('v_7', -1.178076244244115), ('v_11', -1.3234586342526309), ('v_8', -83.08615946716786), ('v_6', -315.0380673447196)]
再一次画出预测与真实值的散点对比图:
plt.scatter(train_X['v_6'][subsample_index], train_y[subsample_index], color='black')plt.scatter(train_X['v_6'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')plt.xlabel('v_6')plt.ylabel('price')plt.legend(['True Price','Predicted Price'],loc='upper right')plt.show()
效果稍微好了一点,但毕竟是线性回归,拟合得还是不够好。
cross_val_score
) 在使用训练集对参数进行训练的时候,经常会发现人们通常会将一整个训练集分为三个部分(比如mnist手写训练集)。一般分为:训练集(train_set
),评估集(valid_set
),测试集(test_set
)这三个部分。这其实是为了保证训练效果而特意设置的。其中测试集很好理解,其实就是完全不参与训练的数据,仅仅用来观测测试效果的数据。而训练集和评估集则牵涉到下面的知识了。
因为在实际的训练中,训练的结果对于训练集的拟合程度通常还是挺好的(初始条件敏感),但是对于训练集之外的数据的拟合程度通常就不那么令人满意了。因此我们通常并不会把所有的数据集都拿来训练,而是分出一部分来(这一部分不参加训练)对训练集生成的参数进行测试,相对客观的判断这些参数对训练集之外的数据的符合程度。这种思想就称为交叉验证(Cross Validation
)。
直观的类比就是训练集是上课,评估集是平时的作业,而测试集是最后的期末考试。😏
Cross Validation
:简言之,就是进行多次train_test_split
划分;每次划分时,在不同的数据集上进行训练、测试评估,从而得出一个评价结果;如果是5折交叉验证,意思就是在原始数据集上,进行5次划分,每次划分进行一次训练、评估,最后得到5次划分后的评估结果,一般在这几次评估结果上取平均得到最后的评分。k-fold cross-validation
,其中,k
一般取5或10。
一般情况将K折交叉验证用于模型调优,找到使得模型泛化性能最优的超参值。找到后,在全部训练集上重新训练模型,并使用独立测试集对模型性能做出最终评价。K折交叉验证使用了无重复抽样技术的好处:每次迭代过程中每个样本点只有一次被划入训练集或测试集的机会。
更多参考资料:几种交叉验证(cross validation)方式的比较
、k折交叉验证
sklearn.model_selection
的cross_val_score
进行交叉验证from sklearn.model_selection import cross_val_scorefrom sklearn.metrics import mean_absolute_error, make_scorer
cross_val_score
相应函数的应用def log_transfer(func): def wrapper(y, yhat): result = func(np.log(y), np.nan_to_num(np.log(yhat))) return result return wrapper
上面的log_transfer
是提供装饰器功能,是为了将下面的cross_val_score
的make_scorer
的mean_absolute_error
(它的公式在下面)的输入参数做对数处理,其中np.nan_to_num
顺便将nan
转变为0。
$$
MAE=\frac{\sum\limits_^\left|y_-\hat_\right|}
$$
cross_val_score
是sklearn
用于交叉验证评估分数的函数,前面几个参数很明朗,后面几个参数需要解释一下。
verbose
:详细程度,也就是是否输出进度信息cv
:交叉验证生成器或可迭代的次数scoring
:调用用来评价的方法,是score越大约好,还是loss越小越好,默认是loss。这里调用了mean_absolute_error
,只是在调用之前先进行了log_transfer
的装饰,然后调用的y
和yhat
,会自动将cross_val_score
得到的X
和y
代入。make_scorer
:构建一个完整的定制scorer函数,可选参数greater_is_better
,默认为False
,也就是loss越小越好下面是对未进行对数处理的原特征数据进行五折交叉验证
scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.2s finished
print('AVG:', np.mean(scores))
AVG: 0.7533845471636889
scores = pd.DataFrame(scores.reshape(1,-1)) # 转化成一行,(-1,1)为一列scores.columns = ['cv' + str(x) for x in range(1, 6)]scores.index = ['MAE']scores
cv1 | cv2 | cv3 | cv4 | cv5 | |
---|---|---|---|---|---|
MAE | 0.727867 | 0.759451 | 0.781238 | 0.750681 | 0.747686 |
使用线性回归模型,对进行过对数处理的原特征数据进行五折交叉验证
scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished
print('AVG:', np.mean(scores))
AVG: 0.2124134663602803
scores = pd.DataFrame(scores.reshape(1,-1))scores.columns = ['cv' + str(x) for x in range(1, 6)]scores.index = ['MAE']scores
cv1 | cv2 | cv3 | cv4 | cv5 | |
---|---|---|---|---|---|
MAE | 0.208238 | 0.212408 | 0.215933 | 0.210742 | 0.214747 |
可以看出进行对数处理后,五折交叉验证的loss显著降低。
例如:通过2018年的二手车价格预测2017年的二手车价格,这显然是不合理的,因此我们还可以采用时间顺序对数据集进行分隔。在本例中,我们选用靠前时间的4/5样本当作训练集,靠后时间的1/5当作验证集,最终结果与五折交叉验证差距不大。
import datetimesample_feature = sample_feature.reset_index(drop=True)split_point = len(sample_feature) // 5 * 4train = sample_feature.loc[:split_point].dropna()val = sample_feature.loc[split_point:].dropna()train_X = train[continuous_feature_names]train_y_ln = np.log(train['price'])val_X = val[continuous_feature_names]val_y_ln = np.log(val['price'])
model = model.fit(train_X, train_y_ln)
mean_absolute_error(val_y_ln, model.predict(val_X))
0.21498301182417004
学习曲线是一种用来判断训练模型的一种方法,它会自动把训练样本的数量按照预定的规则逐渐增加,然后画出不同训练样本数量时的模型准确度。
我们可以把$J_(\theta)$和$J_(\theta)$作为纵坐标,画出与训练集数据集$m$的大小关系,这就是学习曲线。通过学习曲线,可以直观地观察到模型的准确性和训练数据大小的关系。 我们可以比较直观的了解到我们的模型处于一个什么样的状态,如:过拟合(overfitting)或欠拟合(underfitting)
如果数据集的大小为$m$,则通过下面的流程即可画出学习曲线:
1.把数据集分成训练数据集和交叉验证集(可以看作测试集);
2.取训练数据集的20%作为训练样本,训练出模型参数;
3.使用交叉验证集来计算训练出来的模型的准确性;
4.以训练集的score和交叉验证集score为纵坐标(这里的score取决于你使用的make_score
方法,例如MAE),训练集的个数作为横坐标,在坐标轴上画出上述步骤计算出来的模型准确性;
5.训练数据集增加10%,调到步骤2,继续执行,知道训练数据集大小为100%。
learning_curve()
:这个函数主要是用来判断(可视化)模型是否过拟合的。下面是一些参数的解释:
X
:是一个m*n的矩阵,m:数据数量,n:特征数量;y
:是一个m*1的矩阵,m:数据数量,相对于X
的目标进行分类或回归;groups
:将数据集拆分为训练/测试集时使用的样本的标签分组。[可选];train_sizes
:指定训练样品数量的变化规则。比如:np.linspace(0.1, 1.0, 5)表示把训练样品数量从0.1-1分成5等分,生成[0.1, 0.325,0.55,0.75,1]的序列,从序列中取出训练样品数量百分比,逐个计算在当前训练样本数量情况下训练出来的模型准确性。cv
:None
,要使用默认的三折交叉验证(v0.22版本中将改为五折);n_jobs
:要并行运行的作业数。None表示1。 -1表示使用所有处理器;pre_dispatch
:并行执行的预调度作业数(默认为全部)。该选项可以减少分配的内存。该字符串可以是“ 2 * n_jobs”之类的表达式;shuffle
:bool
,是否在基于train_sizes
为前缀之前对训练数据进行洗牌;from sklearn.model_selection import learning_curve, validation_curve
plt.fill_between()
用来填充两条线间区域,其他好像没什么好解释的了。
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )): plt.figure() plt.title(title) if ylim is not None: plt.ylim(*ylim) plt.xlabel('Training example') plt.ylabel('score') train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error)) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) plt.grid()#区域 plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") plt.plot(train_sizes, train_scores_mean, 'o-', color='r', label="Training score") plt.plot(train_sizes, test_scores_mean,'o-',color="g", label="Cross-validation score") plt.legend(loc="best") return plt
plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:], train_y_ln[:], ylim=(0.0, 0.5), cv=5, n_jobs=-1)
<module 'matplotlib.pyplot' from 'D:\\Software\\Anaconda\\lib\\site-packages\\matplotlib\\pyplot.py'>
训练误差与验证误差逐渐一致,准确率也挺高(这里的score是MAE,所以是loss趋近于0.2,准确率趋近于0.8),但是训练误差几乎没变过,所以属于过拟合。这里给出一下高偏差欠拟合(bias)以及高方差过拟合(variance)的模样:
更形象一点:
Data:
Normal fitting:
overfitting:
serious overfitting:
train = sample_feature[continuous_feature_names + ['price']].dropna()train_X = train[continuous_feature_names]train_y = train['price']train_y_ln = np.log(train_y + 1)
有一些前叙知识需要补全。其中关于正则化的知识:
更多其他知识可以看这篇文章:机器学习中正则化项L1和L2的直观理解
在过滤式和包裹式特征选择方法中,特征选择过程与学习器训练过程有明显的分别。而嵌入式特征选择在学习器训练过程中自动地进行特征选择。嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后,他们分别变成了岭回归与Lasso回归。
LinearRegression
,Ridge
,Lasso
方法的运行from sklearn.linear_model import LinearRegressionfrom sklearn.linear_model import Ridgefrom sklearn.linear_model import Lasso
models = [LinearRegression(), Ridge(), Lasso()]
result = dict()for model in models: model_name = str(model).split('(')[0] scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)) result[model_name] = scores print(model_name + ' is finished')
LinearRegression is finishedRidge is finishedD:\Software\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems. ConvergenceWarning)D:\Software\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems. ConvergenceWarning)D:\Software\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems. ConvergenceWarning)D:\Software\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems. ConvergenceWarning)Lasso is finishedD:\Software\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems. ConvergenceWarning)
result = pd.DataFrame(result)result.index = ['cv' + str(x) for x in range(1, 6)]result
LinearRegression | Ridge | Lasso | |
---|---|---|---|
cv1 | 0.208238 | 0.213319 | 0.394868 |
cv2 | 0.212408 | 0.216857 | 0.387564 |
cv3 | 0.215933 | 0.220840 | 0.402278 |
cv4 | 0.210742 | 0.215001 | 0.396664 |
cv5 | 0.214747 | 0.220031 | 0.397400 |
1.纯LinearRegression
方法的情况:.intercept_
是截距(与y轴的交点)即$\theta_0$,.coef_
是模型的斜率即$\theta_1 - \theta_n$
model = LinearRegression().fit(train_X, train_y_ln)print('intercept:'+ str(model.intercept_)) # 截距(与y轴的交点)sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:22.23769348625359<matplotlib.axes._subplots.AxesSubplot at 0x210418e4d68>
纯LinearRegression
回归可以发现,得到的参数列表是比较稀疏的。
model.coef_
array([-3.73489972e-09, -6.10060860e-08, 7.40515349e-03, -1.88182450e-03, -1.24570527e-04, 3.63911807e-04, -3.32722751e-02, -2.75710825e-01, -1.43048695e-03, -3.28514719e-03, 8.57926933e-02, 5.66930260e+00, 8.27635812e-01, 4.09620867e-01, -6.72467882e-01, 4.24497013e+00, -3.15038152e+02, -1.17801777e+00, -8.30861129e+01, 6.01215351e-01, 9.25141289e-01, -1.32345773e+00, 1.20182089e+00, 1.10218030e+00, -2.59470516e-02, 8.88178420e-13, -2.92746484e-05, -3.63331132e-03, -1.63354329e-07, 5.68181101e-05, 1.29502381e-04, -2.97497182e-05, 2.21512681e-09, 4.26377388e-05, -1.18112552e-04, 1.35814944e-02])
2.Lasso
方法即L1正则化的情况:
model = Lasso().fit(train_X, train_y_ln)print('intercept:'+ str(model.intercept_))sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:7.946156528722565D:\Software\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.py:492: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems. ConvergenceWarning)<matplotlib.axes._subplots.AxesSubplot at 0x210405debe0>
L1正则化有助于生成一个稀疏权值矩阵,进而可以用于特征选择。如上图,我们发现power与userd_time特征非常重要。
3.Ridge
方法即L2正则化的情况:
model = Ridge().fit(train_X, train_y_ln)print('intercept:'+ str(model.intercept_))sns.barplot(abs(model.coef_), continuous_feature_names)
intercept:2.7820015512913994<matplotlib.axes._subplots.AxesSubplot at 0x2103fdd99b0>
从上图可以看到有很多参数离0较远,很多为0。
原因在于L2正则化在拟合过程中通常都倾向于让权值尽可能小,最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单,能适应不同的数据集,也在一定程度上避免了过拟合现象。
可以设想一下对于一个线性回归方程,若参数很大,那么只要数据偏移一点点,就会对结果造成很大的影响;但如果参数足够小,数据偏移得多一点也不会对结果造成什么影响,专业一点的说法是『抗扰动能力强』
除此之外,决策树通过信息熵或GINI指数选择分裂节点时,优先选择的分裂特征也更加重要,这同样是一种特征选择的方法。XGBoost与LightGBM模型中的model_importance指标正是基于此计算的
支持向量机,决策树,随机森林,梯度提升树(GBDT),多层感知机(MLP),XGBoost,LightGBM等
from sklearn.linear_model import LinearRegressionfrom sklearn.svm import SVCfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.ensemble import GradientBoostingRegressorfrom sklearn.neural_network import MLPRegressorfrom xgboost.sklearn import XGBRegressorfrom lightgbm.sklearn import LGBMRegressor
定义模型集合
models = [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor(), GradientBoostingRegressor(), MLPRegressor(solver='lbfgs', max_iter=100), XGBRegressor(n_estimators = 100, objective='reg:squarederror'), LGBMRegressor(n_estimators = 100)]
用数据一一对模型进行训练
result = dict()for model in models: model_name = str(model).split('(')[0] scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)) result[model_name] = scores print(model_name + ' is finished')
LinearRegression is finishedDecisionTreeRegressor is finishedRandomForestRegressor is finishedGradientBoostingRegressor is finishedMLPRegressor is finishedXGBRegressor is finishedLGBMRegressor is finished
result = pd.DataFrame(result)result.index = ['cv' + str(x) for x in range(1, 6)]result
LinearRegression | DecisionTreeRegressor | RandomForestRegressor | GradientBoostingRegressor | MLPRegressor | XGBRegressor | LGBMRegressor | |
---|---|---|---|---|---|---|---|
cv1 | 0.208238 | 0.224863 | 0.163196 | 0.179385 | 581.596878 | 0.155881 | 0.153942 |
cv2 | 0.212408 | 0.218795 | 0.164292 | 0.183759 | 182.180288 | 0.158566 | 0.160262 |
cv3 | 0.215933 | 0.216482 | 0.164849 | 0.185005 | 250.668763 | 0.158520 | 0.159943 |
cv4 | 0.210742 | 0.220903 | 0.160878 | 0.181660 | 139.101476 | 0.156608 | 0.157528 |
cv5 | 0.214747 | 0.226087 | 0.164713 | 0.183704 | 108.664261 | 0.173250 | 0.157149 |
可以看到随机森林模型在每一个fold中均取得了更好的效果
np.mean(result['RandomForestRegressor'])
0.16358568277026037
三种常用的调参方法如下:
贪心算法 https://www.jianshu.com/p/ab89df9759c8
网格调参 https://blog.csdn.net/weixin_43172660/article/details/83032029
贝叶斯调参 https://blog.csdn.net/linxid/article/details/81189154
## LGB的参数集合:objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']num_leaves = [3,5,10,15,20,40, 55]max_depth = [3,5,10,15,20,40, 55]bagging_fraction = []feature_fraction = []drop_rate = []
best_obj = dict()for obj in objective: model = LGBMRegressor(objective=obj) score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))) best_obj[obj] = score best_leaves = dict()for leaves in num_leaves: model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves) score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))) best_leaves[leaves] = score best_depth = dict()for depth in max_depth: model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0], max_depth=depth) score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))) best_depth[depth] = score
sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())])
<matplotlib.axes._subplots.AxesSubplot at 0x21041776128>
from sklearn.model_selection import GridSearchCV
parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}model = LGBMRegressor()clf = GridSearchCV(model, parameters, cv=5)clf = clf.fit(train_X, train_y)
clf.best_params_
{'max_depth': 10, 'num_leaves': 55, 'objective': 'regression'}
model = LGBMRegressor(objective='regression', num_leaves=55, max_depth=10)
np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
0.1526351038235066
!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple bayesian-optimizationfrom bayes_opt import BayesianOptimization
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simpleCollecting bayesian-optimization Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b5/26/9842333adbb8f17bcb3d699400a8b1ccde0af0b6de8d07224e183728acdf/bayesian_optimization-1.1.0-py3-none-any.whlRequirement already satisfied: scikit-learn>=0.18.0 in d:\software\anaconda\lib\site-packages (from bayesian-optimization) (0.20.3)Requirement already satisfied: scipy>=0.14.0 in d:\software\anaconda\lib\site-packages (from bayesian-optimization) (1.2.1)Requirement already satisfied: numpy>=1.9.0 in d:\software\anaconda\lib\site-packages (from bayesian-optimization) (1.16.2)Installing collected packages: bayesian-optimizationSuccessfully installed bayesian-optimization-1.1.0
def rf_cv(num_leaves, max_depth, subsample, min_child_samples): val = cross_val_score( LGBMRegressor(objective = 'regression_l1', num_leaves=int(num_leaves), max_depth=int(max_depth), subsample = subsample, min_child_samples = int(min_child_samples) ), X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error) ).mean() return 1 - val # 贝叶斯调参目标是求最大值,所以用1减去误差
rf_bo = BayesianOptimization( rf_cv, { 'num_leaves': (2, 100), 'max_depth': (2, 100), 'subsample': (0.1, 1), 'min_child_samples' : (2, 100) })
rf_bo.maximize()
| iter | target | max_depth | min_ch... | num_le... | subsample |-------------------------------------------------------------------------| [0m 1 [0m | [0m 0.8493 [0m | [0m 80.61 [0m | [0m 97.58 [0m | [0m 44.92 [0m | [0m 0.881 [0m || [95m 2 [0m | [95m 0.8514 [0m | [95m 35.87 [0m | [95m 66.92 [0m | [95m 57.68 [0m | [95m 0.7878 [0m || [95m 3 [0m | [95m 0.8522 [0m | [95m 49.75 [0m | [95m 68.95 [0m | [95m 64.99 [0m | [95m 0.1726 [0m || [0m 4 [0m | [0m 0.8504 [0m | [0m 35.58 [0m | [0m 10.83 [0m | [0m 53.8 [0m | [0m 0.1306 [0m || [0m 5 [0m | [0m 0.7942 [0m | [0m 63.37 [0m | [0m 32.21 [0m | [0m 3.143 [0m | [0m 0.4555 [0m || [0m 6 [0m | [0m 0.7997 [0m | [0m 2.437 [0m | [0m 4.362 [0m | [0m 97.26 [0m | [0m 0.9957 [0m || [95m 7 [0m | [95m 0.8526 [0m | [95m 47.85 [0m | [95m 69.39 [0m | [95m 68.02 [0m | [95m 0.8833 [0m || [95m 8 [0m | [95m 0.8537 [0m | [95m 96.87 [0m | [95m 4.285 [0m | [95m 99.53 [0m | [95m 0.9389 [0m || [95m 9 [0m | [95m 0.8546 [0m | [95m 96.06 [0m | [95m 97.85 [0m | [95m 98.82 [0m | [95m 0.8874 [0m || [0m 10 [0m | [0m 0.7942 [0m | [0m 8.165 [0m | [0m 99.06 [0m | [0m 3.93 [0m | [0m 0.2049 [0m || [0m 11 [0m | [0m 0.7993 [0m | [0m 2.77 [0m | [0m 99.47 [0m | [0m 91.16 [0m | [0m 0.2523 [0m || [0m 12 [0m | [0m 0.852 [0m | [0m 99.3 [0m | [0m 43.04 [0m | [0m 62.67 [0m | [0m 0.9897 [0m || [0m 13 [0m | [0m 0.8507 [0m | [0m 96.57 [0m | [0m 2.749 [0m | [0m 55.2 [0m | [0m 0.6727 [0m || [0m 14 [0m | [0m 0.8168 [0m | [0m 3.076 [0m | [0m 3.269 [0m | [0m 33.78 [0m | [0m 0.5982 [0m || [0m 15 [0m | [0m 0.8527 [0m | [0m 71.88 [0m | [0m 7.624 [0m | [0m 76.49 [0m | [0m 0.9536 [0m || [0m 16 [0m | [0m 0.8528 [0m | [0m 99.44 [0m | [0m 99.28 [0m | [0m 69.58 [0m | [0m 0.7682 [0m || [0m 17 [0m | [0m 0.8543 [0m | [0m 99.93 [0m | [0m 45.95 [0m | [0m 97.54 [0m | [0m 0.5095 [0m || [0m 18 [0m | [0m 0.8518 [0m | [0m 60.87 [0m | [0m 99.67 [0m | [0m 61.3 [0m | [0m 0.7369 [0m || [0m 19 [0m | [0m 0.8535 [0m | [0m 99.69 [0m | [0m 16.58 [0m | [0m 84.31 [0m | [0m 0.1025 [0m || [0m 20 [0m | [0m 0.8507 [0m | [0m 54.68 [0m | [0m 38.11 [0m | [0m 54.65 [0m | [0m 0.9796 [0m || [0m 21 [0m | [0m 0.8538 [0m | [0m 99.1 [0m | [0m 81.79 [0m | [0m 84.03 [0m | [0m 0.9823 [0m || [0m 22 [0m | [0m 0.8529 [0m | [0m 99.28 [0m | [0m 3.373 [0m | [0m 83.48 [0m | [0m 0.7243 [0m || [0m 23 [0m | [0m 0.8512 [0m | [0m 52.67 [0m | [0m 2.614 [0m | [0m 59.65 [0m | [0m 0.5286 [0m || [95m 24 [0m | [95m 0.8546 [0m | [95m 75.81 [0m | [95m 61.62 [0m | [95m 99.78 [0m | [95m 0.9956 [0m || [0m 25 [0m | [0m 0.853 [0m | [0m 45.9 [0m | [0m 33.68 [0m | [0m 74.59 [0m | [0m 0.73 [0m || [0m 26 [0m | [0m 0.8532 [0m | [0m 82.58 [0m | [0m 63.9 [0m | [0m 78.61 [0m | [0m 0.1014 [0m || [0m 27 [0m | [0m 0.8544 [0m | [0m 76.15 [0m | [0m 97.58 [0m | [0m 95.07 [0m | [0m 0.9995 [0m || [0m 28 [0m | [0m 0.8545 [0m | [0m 95.75 [0m | [0m 74.96 [0m | [0m 99.45 [0m | [0m 0.7263 [0m || [0m 29 [0m | [0m 0.8532 [0m | [0m 80.84 [0m | [0m 89.28 [0m | [0m 77.31 [0m | [0m 0.9389 [0m || [0m 30 [0m | [0m 0.8545 [0m | [0m 82.92 [0m | [0m 35.46 [0m | [0m 96.66 [0m | [0m 0.969 [0m |=========================================================================
rf_bo.max
{'target': 0.8545792238909576, 'params': {'max_depth': 75.80893509302794, 'min_child_samples': 61.62267920507557, 'num_leaves': 99.77501502667806, 'subsample': 0.9955706357612557}}
1 - rf_bo.max['target']
0.14542077610904236
在本章中,我们完成了建模与调参的工作,并对我们的模型进行了验证。此外,我们还采用了一些基本方法来提高预测的精度,提升如下图所示。
plt.figure(figsize=(13,5))sns.lineplot(x=['0_origin','1_log_transfer','2_L1_&_L2','3_change_model','4_parameter_turning'], y=[1.36 ,0.19, 0.19, 0.16, 0.15])
<matplotlib.axes._subplots.AxesSubplot at 0x21041688208>
]]>
又到了毕业季,学弟学妹们开始了毕设之旅,提到毕设想到了什么呢?对,没错,必备技巧就是绘制各种精美绝伦,举世无双的高清美图。这不,我刚炖了碗鲜美的极坐标热力图气象图汤。😢
如下:
数据可以是随机产生,或者放在csv
文件中读。在csv
中存储格式如下:
| pos | 0 | 30 | 60 | 90 |
|-----|--------------|--------------|--------------|--------------|
| 0 | 1.101447148 | 1.308827831 | 1.526038083 | 1.603848713 |
| 30 | 1.101447148 | 1.279591136 | 1.49432297 | 1.577829862 |
| 60 | 1.101447148 | 1.204513965 | 1.435064241 | 1.52576792 |
| 90 | 1.101447148 | 1.108569817 | 1.404547306 | 1.499676995 |
| 120 | 1.101447148 | 1.204513965 | 1.435064241 | 1.52576792 |
| 150 | 1.101447148 | 1.279591136 | 1.49432297 | 1.577829862 |
| 180 | 1.101447148 | 1.308827831 | 1.526038083 | 1.603848713 |
| 210 | 1.101447148 | 1.279591136 | 1.49432297 | 1.577829862 |
| 240 | 1.101447148 | 1.204513965 | 1.435064241 | 1.52576792 |
| 270 | 1.101447148 | 1.108569817 | 1.404547306 | 1.499676995 |
| 300 | 1.101447148 | 1.204513965 | 1.435064241 | 1.52576792 |
| 330 | 1.101447148 | 1.279591136 | 1.49432297 | 1.577829862 |
| 360 | 1.101447148 | 1.308827831 | 1.526038083 | 1.603848713 |
因为要绘制的是极坐标图,所以列名代表的就是弧度,而行名代表的就是半径。
csv文件下载:data.csv,下载后复制成四份,分别命名为data1.csv
,data2.csv
,data3.csv
,data4.csv
。
import numpy as npimport pandas as pdfrom scipy.interpolate import interp2d # 后面需要的插值库from matplotlib import pyplot as plt
csv
文件中读取数据data1 = pd.read_csv('data1.csv')data2 = pd.read_csv('data2.csv')data3 = pd.read_csv('data3.csv')data4 = pd.read_csv('data4.csv')data = [data1, data2, data3, data4]pos = np.array(data['pos']/180*np.pi)ind = np.array(data.columns[1:], dtype=np.int)values = np.array(data[ind.astype('str')])
pos = np.radians(np.linspace(0, 360, 30))ind = np.arange(0, 90, 10)values = np.random.random((pos.size, ind.size))
import numpy as npimport pandas as pdfrom scipy.interpolate import interp2dfrom matplotlib import pyplot as pltdata1 = pd.read_csv('data1.csv')data2 = pd.read_csv('data2.csv')data3 = pd.read_csv('data3.csv')data4 = pd.read_csv('data4.csv')data = [data1, data2, data3, data4]def plot_weather_heatmap(dataList, title): plt.figure(figsize=(25, 25)) for i in range(len(dataList)): data = dataList[i] ''' 方法一:从csv文件中读取数据 ''' # pos = np.array(data['pos']/180*np.pi) # ind = np.array(data.columns[1:], dtype=np.int) # values = np.array(data[ind.astype('str')]) ''' 方法二:随机产生数据 ''' pos = np.radians(np.linspace(0, 360, 30)) ind = np.arange(0, 90, 10) values = np.random.random((pos.size, ind.size)) #计算插值函数 func = interp2d(pos, ind, values.T, kind='cubic') tnew = np.linspace(0, 2*np.pi, 200) # theta #绘图数据点 rnew = np.linspace(0, 90, 100) # r vnew = func(tnew, rnew) tnew, rnew = np.meshgrid(tnew, rnew) ax = plt.subplot(2, 2, i+1, projection='polar') plt.pcolor(tnew, rnew, vnew, cmap='jet') plt.grid(c='black') plt.colorbar() ax.set_theta_zero_location("N") ax.set_theta_direction(-1) plt.title(title[i], fontsize=20) #设置坐标标签标注和字体大小 plt.xlabel(' ', fontsize=15) plt.ylabel(' ', fontsize=15) #设置坐标刻度字体大小 plt.xticks(fontsize=15, rotation=90) plt.yticks(fontsize=15) # cb.set_label("Pixel reflectance")title = ['Spring', 'Summer', 'Autumn', 'Winter']plot_weather_heatmap(data, title)plt.savefig("pic.png", dpi=300)plt.show()
cmap
参数,为了更好看 关于下面这句中的jet
参数是指定图的色域,可以更换。
plt.pcolor(tnew, rnew, vnew, cmap='jet')
可选值如下
毕业季到了,要开始写论文了,朋友圈各种同学的各种课题的调查问卷,但几乎没什么人填,想帮他们随机填一填。
jq
而不是m
;times
里面修改数值;FillTheQuestionaire
函数里面修改;from selenium import webdriverimport timeimport reimport osfrom bs4 import BeautifulSoupfrom lxml import etreeimport randomimport pandas as pddef ChangeIP(): page = random.randint(1,4055) url = 'https://www.xicidaili.com/nn/' # url = 'https://www.kuaidaili.com/free/' driverIP = webdriver.Chrome() driverIP.get(url) content = driverIP.page_source.encode('utf-8') html = etree.HTML(content) t = html.xpath("//div[@class='bar']/div[@class='bar_inner fast']") flow = ['99%','98%'] for i in t: if i.attrib['style'].split(":")[1] in flow: index = t.index(i) break driverIP.quit() data = pd.read_html(content) ipinfo = data[0].values[index] # ipinfo = random.choice(data[0].values) ip = str(ipinfo[5]).lower() + "://" + str(ipinfo[1]) + ":" + str(ipinfo[2]) # ip = str(ipinfo[3]).lower() + "://" + str(ipinfo[0]) + ":" + str(ipinfo[1]) return ipdef FillTheQuestionaire(times): url = 'https://www.wjx.cn/jq/74385885.aspx' for t in range(times): mobileEmulation = {'deviceName': 'iPhone X'} options = webdriver.ChromeOptions() options.add_experimental_option('mobileEmulation', mobileEmulation) # options = webdriver.ChromeOptions() # ip = ChangeIP() # print(ip) # options.add_argument("--proxy-server=" + ip) # driver = webdriver.Chrome(chrome_options=options) # if t % 2 == 0: driver = webdriver.Chrome() # else: # driver = webdriver.Chrome(chrome_options=options) driver.get(url) content = driver.page_source.encode('utf-8') html = etree.HTML(content) soup = BeautifulSoup(content, 'lxml') NumOfQuestions = len(driver.find_elements_by_xpath( "//div[@class='div_question']")) for quiz in range(NumOfQuestions): try: question = driver.find_elements_by_xpath("//div[@id='divquestion" + str( quiz + 1) + "']//ul[@class='ulradiocheck']//li//a[@class='jqRadio']") random.choice(question).click() except: pass try: tr = driver.find_elements_by_xpath( "//div[@id='divquestion" + str(quiz + 1) + "']/table/tbody/tr") for t in range(len(tr)): button = driver.find_elements_by_xpath("//div[@id='divquestion" + str( quiz + 1) + "']/table/tbody/tr[" + str(t + 1) + "]/td/a[@class='jqRadio']") try: random.choice(button).click() except: pass except: pass try: checkbox = driver.find_elements_by_xpath("//div[@id='divquestion" + str( quiz + 1) + "']//ul[@class='ulradiocheck']//li//a[@class='jqCheckbox']") YorN = [x for x in range(2)] checkbox[0].click() for i in range(len(checkbox) - 2): if random.choice(YorN) == 1: print("是") try: checkbox[i+1].click() except: pass except: pass time.sleep(3) driver.find_elements_by_xpath("//input[@id='submit_button']")[0].click() time.sleep(2) driver.quit()if __name__ == "__main__": times = 10 FillTheQuestionaire(times)
]]>
load_data()
:读取数据,并转换为可用的形式;split_data()
:将数据集分为训练集和测试集;train()
:从当前数据集中训练模型;predict()
:用train()
生成的模型,对测试集的学生进行分班;evaluate()
:输出模型的准确率。train()
和predict()
不可以用第三方库;instances
);1 school - students school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
2 sex - students sex (binary: "F" - female or "M" - male)
3 address - students home address type (binary: "U" - urban or "R" - rural)
4 famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
5 Pstatus - parents cohabitation status (binary: "T" - living together or "A" - apart)
6 Medu - mothers education (nominal: low, none, mid, high)
7 Fedu - fathers education (nominal: low, none, mid, high)
8 Mjob - mothers job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
9 Fjob - fathers job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
10 reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
11 guardian - students guardian (nominal: "mother", "father" or "other")
12 traveltime - home to school travel time (nominal: none, low, medium, high, very_high)
13 studytime - weekly study time (nominal: none, low, medium, high, very_high)
14 failures - number of past class failures (nominal: none, low, medium, high, very_high)
15 schoolsup - extra educational support (binary: yes or no)
16 famsup - family educational support (binary: yes or no)
17 paid - extra paid classes within the course subject (binary: yes or no)
18 activities - extra-curricular activities (binary: yes or no)
19 nursery - attended nursery school (binary: yes or no)
20 higher - wants to take higher education (binary: yes or no)
21 internet - Internet access at home (binary: yes or no)
22 romantic - with a romantic relationship (binary: yes or no)
23 famrel - quality of family relationships (nominal: very_bad, bad, mediocre, good, excellent)
24 freetime - free time after school (nominal: very_low, low, mediocre, high, very_high)
25 goout - going out with friends (nominal: very_low, low, mediocre, high, very_high)
26 Dalc - workday alcohol consumption (nominal: very_low, low, mediocre, high, very_high)
27 Walc - weekend alcohol consumption (nominal: very_low, low, mediocre, high, very_high)
28 health - current health status (nominal: very_bad, bad, mediocre, good, excellent)
29 absences - number of school absences (nominal: none, one_to_three, four_to_six, seven_to_ten, more_than_ten)
30 Grade - final grade (A+, A, B, C, D, F)
load_data()
# This function should open a data file in csv, and transform it into a usable format def load_data(): import pandas as pd data = pd.read_csv('student.csv', sep=',') return data
split_data()
# This function should split a data set into a training set and hold-out test setdef split_data(data, test_size): """ split the data into train set and test set :param data: Dtype from pd.read_csv :param test_size: float, define the position to split :return: """ import numpy as np X = data[list(data.columns[:-1])].values # get the instances matrix y = data['Grade'] # get the class vector index = np.arange(data.shape[0]) # get the number of the dataset np.random.shuffle(index) # shuffle the order of the data X = X[index] # reorder the instances matrix y = y[index] # reorder the class vector split_point = int(X.shape[0] * test_size) # define the position to split the data into train and test X_train, X_test = X[:split_point], X[split_point:] y_train, y_test = y[:split_point], y[split_point:] return X_train, X_test, y_train, y_test
train()
# This function should build a supervised NB modeldef train(X, y, alpha): """ train or generate the probability matrix of Naive Bayes Classifier :param X: Dtype from pd.read_csv, train set :param y: Dtype from pd.read_csv, train class :param alpha: Laplace smooth index :return: """ y_class_count = {} feature_dimension = len(X[1]) # number of feature # get the number of each labels for c in y: y_class_count[c] = y_class_count.get(c, 0) + 1 # generate the dict of class, e.g. {'A':'69',...} y_class_tuple = sorted(y_class_count.items(), reverse=False) # generate the tuple of class and sort it in terms of number, e.g. [('A','69'),...] K = len(y_class_tuple) # the specific number of class grade N = len(y) # the number of instances # get the prior probability prior_prob = {} for key in range(len(y_class_tuple)): prior_prob[y_class_tuple[key][0]] = (y_class_tuple[key][1] + alpha) / (N + K * alpha) # laplace smooth # get the value set of each feature feature_value = [] # feature with different value feature_value_number = [] # the number of unique values of each feature for feature in range(feature_dimension): unique_feature = list(set(X[:, feature])) # use `set` to get the unique value feature_value_number.append(len(unique_feature)) feature_value.append(unique_feature) # calculate the conditional probability conditional_prob = [] # calculate the count (x = a & y = c) for j in range(feature_dimension): count = [[0 for i in range(len(y_class_count))] for i in range(feature_value_number[j])] # use list comprehension to generate zero matrix, (feature_value_number[j] rows x y_class_count cols) for i in range(len(X[:, j])): for k in range(len(feature_value[j])): for t in range(len(y_class_count)): if X[:, j][i] == feature_value[j][k] and list(y)[i] == y_class_tuple[t][0]: # x = value and y = class, get the count count[k][t] += 1 # calculate the conditional probability for m in range(len(y_class_tuple)): for r in range(len(count)): count[r][m] = (count[r][m] + alpha) / (y_class_tuple[m][1] + alpha * feature_value_number[j]) # laplace smoothing conditional_prob.append(count) return y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob
predict()
def classify(y_class_tuple, prior_prob, feature_value, conditional_prob, feature_value_number, alpha, instance): """ generate the answer of classification :param y_class_tuple: list, the tuple of class and sort it in terms of number :param prior_prob: float list, prior probability of class :param feature_value: list, feature value of all the attributes :param conditional_prob: float list, posterior probability :param feature_value_number: float list, number of different unique features :param alpha: float, Laplace smooth index default 1 :param instance: list, one row of test set :return: """ import math predict = {} for m in range(len(y_class_tuple)): # get the prior_probability of m-th label in y_class_tuple yhat = math.log(prior_prob[y_class_tuple[m][0]]) # use log-transformation to avoid float missing for n in range(len(instance)): if instance[n] in feature_value[n]: index = feature_value[n].index(instance[n]) # locate the feature in feature_value yhat = yhat + math.log(conditional_prob[n][index][m]) # accumulate the probability else: # if the value of feature is not in training set, return the laplace smoothing yhat = alpha / (feature_value_number[n] * alpha) predict[y_class_tuple[m][0]] = yhat return predict
# This function should predict the class for an instance or a set of instances, based on a trained model def predict(y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob, X, alpha, flag=0): """ predict the class for an instance or a set of instances, based on a trained model :param y_class_tuple: list, the tuple of class and sort it in terms of number :param prior_prob: float list, prior probability of class :param feature_value: list, feature value of all the attributes :param conditional_prob: float list, posterior probability :param feature_value_number: float list, number of different unique features :param alpha: float, Laplace smooth index default 1 :param X: Dtype from pd.read_csv, test set :param flag: set 1 return probability or set 0 return prediction, default 0 :return: """ import operator as op test_num = len(X) prediction = [0 for i in range(test_num)] probability = [0 for i in range(test_num)] for i in range(test_num): result = classify(y_class_tuple, prior_prob, feature_value, conditional_prob, feature_value_number, 1, X[i, :]) # result is the probability of each class result = sorted(result.items(), key=op.itemgetter(1), reverse=True) # the max probability is the predict class prediction[i] = result[0][0] # show the predict answer probability[i] = result[0][1] # show the predict probability if flag: return probability else: return prediction
evaluate()
# This function should evaluate a set of predictions in terms of accuracydef evaluate(p, y_test): accuracy = sum(p == y_test)/len(y_test) return accuracy
data = load_data()X_train, X_test, y_train, y_test = split_data(data, 0.7)y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob = train(X_train, y_train, 1)p = predict(y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob, X_test, 1)evaluate(p, y_test)
# This function should open a data file in csv, and transform it into a usable format def load_data(): import pandas as pd data = pd.read_csv('student.csv', sep=',') return data# This function should split a data set into a training set and hold-out test setdef split_data(data, test_size): """ split the data into train set and test set :param data: Dtype from pd.read_csv :param test_size: float, define the position to split :return: """ import numpy as np X = data[list(data.columns[:-1])].values # get the instances matrix y = data['Grade'] # get the class vector index = np.arange(data.shape[0]) # get the number of the dataset np.random.shuffle(index) # shuffle the order of the data X = X[index] # reorder the instances matrix y = y[index] # reorder the class vector split_point = int(X.shape[0] * test_size) # define the position to split the data into train and test X_train, X_test = X[:split_point], X[split_point:] y_train, y_test = y[:split_point], y[split_point:] return X_train, X_test, y_train, y_test# This function should build a supervised NB modeldef train(X, y, alpha): """ train or generate the probability matrix of Naive Bayes Classifier :param X: Dtype from pd.read_csv, train set :param y: Dtype from pd.read_csv, train class :param alpha: Laplace smooth index :return: """ y_class_count = {} feature_dimension = len(X[1]) # number of feature # get the number of each labels for c in y: y_class_count[c] = y_class_count.get(c, 0) + 1 # generate the dict of class, e.g. {'A':'69',...} y_class_tuple = sorted(y_class_count.items(), reverse=False) # generate the tuple of class and sort it in terms of number, e.g. [('A','69'),...] K = len(y_class_tuple) # the specific number of class grade N = len(y) # the number of instances # get the prior probability prior_prob = {} for key in range(len(y_class_tuple)): prior_prob[y_class_tuple[key][0]] = (y_class_tuple[key][1] + alpha) / (N + K * alpha) # laplace smooth # get the value set of each feature feature_value = [] # feature with different value feature_value_number = [] # the number of unique values of each feature for feature in range(feature_dimension): unique_feature = list(set(X[:, feature])) # use `set` to get the unique value feature_value_number.append(len(unique_feature)) feature_value.append(unique_feature) # calculate the conditional probability conditional_prob = [] # calculate the count (x = a & y = c) for j in range(feature_dimension): count = [[0 for i in range(len(y_class_count))] for i in range(feature_value_number[j])] # use list comprehension to generate zero matrix, (feature_value_number[j] rows x y_class_count cols) for i in range(len(X[:, j])): for k in range(len(feature_value[j])): for t in range(len(y_class_count)): if X[:, j][i] == feature_value[j][k] and list(y)[i] == y_class_tuple[t][0]: # x = value and y = class, get the count count[k][t] += 1 # calculate the conditional probability for m in range(len(y_class_tuple)): for r in range(len(count)): count[r][m] = (count[r][m] + alpha) / (y_class_tuple[m][1] + alpha * feature_value_number[j]) # laplace smoothing conditional_prob.append(count) return y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_probdef classify(y_class_tuple, prior_prob, feature_value, conditional_prob, feature_value_number, alpha, instance): """ generate the answer of classification :param y_class_tuple: list, the tuple of class and sort it in terms of number :param prior_prob: float list, prior probability of class :param feature_value: list, feature value of all the attributes :param conditional_prob: float list, posterior probability :param feature_value_number: float list, number of different unique features :param alpha: float, Laplace smooth index default 1 :param instance: list, one row of test set :return: """ import math predict = {} for m in range(len(y_class_tuple)): # get the prior_probability of m-th label in y_class_tuple yhat = math.log(prior_prob[y_class_tuple[m][0]]) # use log-transformation to avoid float missing for n in range(len(instance)): if instance[n] in feature_value[n]: index = feature_value[n].index(instance[n]) # locate the feature in feature_value yhat = yhat + math.log(conditional_prob[n][index][m]) # accumulate the probability else: # if the value of feature is not in training set, return the laplace smoothing yhat = alpha / (feature_value_number[n] * alpha) predict[y_class_tuple[m][0]] = yhat return predict# This function should predict the class for an instance or a set of instances, based on a trained model def predict(y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob, X, alpha, flag=0): """ predict the class for an instance or a set of instances, based on a trained model :param y_class_tuple: list, the tuple of class and sort it in terms of number :param prior_prob: float list, prior probability of class :param feature_value: list, feature value of all the attributes :param conditional_prob: float list, posterior probability :param feature_value_number: float list, number of different unique features :param alpha: float, Laplace smooth index default 1 :param X: Dtype from pd.read_csv, test set :param flag: set 1 return probability or set 0 return prediction, default 0 :return: """ import operator as op test_num = len(X) prediction = [0 for i in range(test_num)] probability = [0 for i in range(test_num)] for i in range(test_num): result = classify(y_class_tuple, prior_prob, feature_value, conditional_prob, feature_value_number, 1, X[i, :]) # result is the probability of each class result = sorted(result.items(), key=op.itemgetter(1), reverse=True) # the max probability is the predict class prediction[i] = result[0][0] # show the predict answer probability[i] = result[0][1] # show the predict probability if flag: return probability else: return prediction# This function should evaluate a set of predictions in terms of accuracydef evaluate(p, y_test): accuracy = sum(p == y_test)/len(y_test) return accuracydata = load_data()X_train, X_test, y_train, y_test = split_data(data, 0.7)y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob = train(X_train, y_train, 1)p = predict(y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob, X_test, 1)evaluate(p, y_test)
0.4358974358974359
]]>
python
语言编写的源程序 A,节点 2 执行程序 A,并向节Barrier
Barrier
;Barrier
;Barrier
运行时,程序 A 阻塞在函数中。函数Barrier
发起与节点 1 的通信。等待节点GOON
以后,函数Barrier
返回,程序 A 继续执行直至结束。import socketimport timeimport threadingimport osimport sysdef ReceiveFile(conn): while True: # 连接成功后一直使用当前连接,直到退出 with open("recv.py", "ab") as f: data = conn.recv(1024) if data == b'quit': break if data != b'success': f.write(data) conn.send("success".encode()) print("文件barrier.py已经接收!存储为recv.py") f.close()def SendAnswer(conn): while True: if os.path.exists("recv.py"): ans = os.popen("python recv.py") ansRead = ans.read() print("recv.py运行完毕,得到结果为%s" % (str(ansRead))) with open('output.txt', "w") as f: f.write(ansRead) f.close() print("将得到的结果写入output.txt") with open('output.txt', 'rb') as f: for i in f: conn.send(i) data = conn.recv(1024) if data != b'success': break conn.send('quit'.encode()) print("将output.txt发送完毕!") break conn.close()def SendPyFile(conn): with open('barrier.py', 'rb') as f: for i in f: conn.send(i) data = conn.recv(1024) if data != b'success': break print("文件barrier.py已经发送!") conn.send('quit'.encode())def ReceiveAnswer(conn): while True: with open("recv_output.txt", "ab") as f: data = conn.recv(1024) if data == b'quit': break if data != b'success': f.write(data) conn.send("success".encode()) print("结果接收完毕,存储在recv_output.txt!") f.close() conn.send('quit'.encode())def process(conn): print("等待5秒返回GOON!") for i in range(5): print(i+1) time.sleep(1) conn.send('GOON'.encode())def ClientBarrier(conn): while True: conn, addr = conn.accept() print("barrier函数阻塞,连接建立,地址为%s"%(str(addr))) t = threading.Thread(target=process, args=(conn,)) t.start() break# 改写线程类class msgThread(threading.Thread): def __init__(self, conn, flag): threading.Thread.__init__(self) self.conn = conn self.flag = flag def run(self): if self.flag == "send_file": SendPyFile(self.conn) elif self.flag == "receive_answer": ReceiveAnswer(self.conn) elif self.flag == "receive_file": ReceiveFile(self.conn) else: SendAnswer(self.conn)def Client(address): client = socket.socket(socket.AF_INET, socket.SOCK_STREAM) while True: try: client.connect((address, 6999)) # 建立一个链接,连接到本地的6999端口 break except: print("等待侦听!") time.sleep(1) Thread_receive = msgThread(client, "receive_file") Thread_send = msgThread(client, "send_answer") Thread_receive.start() Thread_send.start()def Server(): server = socket.socket(socket.AF_INET, socket.SOCK_STREAM) server.bind(('127.0.0.1', 6999)) # 绑定要监听的端口 server.listen(5) # 开始监听 表示可以使用五个链接排队 conn, addr = server.accept() # 等待链接,多个链接的时候就会出现问题,其实返回了两个值 print("侦听器已启动!port:6999") print("连接建立,地址在%s"%(str(addr))) Thread_receive = msgThread(conn, "send_file") Thread_send = msgThread(conn, "receive_answer") Thread_barrier = threading.Thread(target=ClientBarrier, args=(server,)) Thread_receive.start() Thread_send.start() Thread_barrier.start()if __name__ == "__main__": BootMode = input("请选择启动方式(1(控制节点)或2(计算节点)):\n") if BootMode == '1': Server() else: port = input("请输入侦听服务器地址(默认127.0.0.1):\n") Client(port)
import randomimport socketdef barrier(): client = socket.socket(socket.AF_INET, socket.SOCK_STREAM) client.connect(('127.0.0.1', 6999)) while True: data = client.recv(1024) if data == b'GOON': break client.send('quit'.encode()) client.close()def MaxMin(): a = [] for i in range(10): a.append(random.random() * 10) barrier() print(max(a), min(a))if __name__ == "__main__": MaxMin()
]]>
python
语言编写的源程序 A,节点 2 执行程序 A,并向节import socketimport timeimport threadingimport osimport sysdef ReceiveFile(conn): while True: # 连接成功后一直使用当前连接,直到退出 with open("recv.py", "ab") as f: data = conn.recv(1024) if data == b'quit': break if data != b'success': f.write(data) conn.send("success".encode()) print("文件test.py已经接收!存储为recv.py") f.close()def SendAnswer(conn): while True: if os.path.exists("recv.py"): ans = os.popen("python recv.py") ansRead = ans.read() print("recv.py运行完毕,得到结果为%s" % (str(ansRead))) with open('output.txt', "w") as f: f.write(ansRead) f.close() print("将得到的结果写入output.txt") with open('output.txt', 'rb') as f: for i in f: conn.send(i) data = conn.recv(1024) if data != b'success': break conn.send('quit'.encode()) print("将output.txt发送完毕!") break conn.close()def SendPyFile(conn): with open('test.py', 'rb') as f: for i in f: conn.send(i) data = conn.recv(1024) if data != b'success': break print("文件test.py已经发送!") conn.send('quit'.encode())def ReceiveAnswer(conn): while True: with open("recv_output.txt", "ab") as f: data = conn.recv(1024) if data == b'quit': break if data != b'success': f.write(data) conn.send("success".encode()) print("结果接收完毕,存储在recv_output.txt!") f.close() conn.send('quit'.encode())class msgThread(threading.Thread): def __init__(self, conn, flag): threading.Thread.__init__(self) self.conn = conn self.flag = flag def run(self): if self.flag == "send_file": SendPyFile(self.conn) elif self.flag == "receive_answer": ReceiveAnswer(self.conn) elif self.flag == "receive_file": ReceiveFile(self.conn) else: SendAnswer(self.conn)# 改写线程类def Client(address): client = socket.socket(socket.AF_INET, socket.SOCK_STREAM) while True: try: client.connect((address, 6999)) # 建立一个链接,连接到本地的6999端口 break except: print("等待侦听!") time.sleep(1) Thread_receive = msgThread(client, "receive_file") Thread_send = msgThread(client, "send_answer") Thread_receive.start() Thread_send.start()def Server(): server = socket.socket(socket.AF_INET, socket.SOCK_STREAM) server.bind(('127.0.0.1', 6999)) # 绑定要监听的端口 server.listen(5) # 开始监听 表示可以使用五个链接排队 conn, addr = server.accept() # 等待链接,多个链接的时候就会出现问题,其实返回了两个值 print("侦听器已启动!port:6999") print("连接建立,地址在%s" % (str(addr))) Thread_receive = msgThread(conn, "send_file") Thread_send = msgThread(conn, "receive_answer") Thread_receive.start() Thread_send.start()if __name__ == "__main__": BootMode = input("请选择启动方式(1(控制节点)或2(计算节点)):\n") if BootMode == '1': Server() else: port = input("请输入侦听服务器地址(默认127.0.0.1):\n") Client(port)
import randoma = []for i in range(10): a.append(random.random() * 10) print(max(a),min(a))
]]>
import socket # 客户端 发送一个数据,再接收一个数据import timeimport threadingimport ctypesimport inspectquit = 0# 终止线程def _async_raise(tid, exctype): """raises the exception, performs cleanup if needed""" tid = ctypes.c_long(tid) if not inspect.isclass(exctype): exctype = type(exctype) res = ctypes.pythonapi.PyThreadState_SetAsyncExc( tid, ctypes.py_object(exctype)) if res == 0: raise ValueError("invalid thread id") elif res != 1: # """if it returns a number greater than one, you're in trouble, # and you should call it again with exc=NULL to revert the effect""" ctypes.pythonapi.PyThreadState_SetAsyncExc(tid, None) raise SystemError("PyThreadState_SetAsyncExc failed")def stop_thread(thread): _async_raise(thread.ident, SystemExit)def ReceiveMsg(conn): global quit while True: try: data = conn.recv(1024) except: print("连接结束") conn.close() break if str(data.decode()).upper() != 'QUIT': print('recive:', data.decode()) else: quit = 1 conn.close() breakdef SendMsg(conn): global quit while True: send = input("send:\n") try: conn.send(send.encode('utf-8')) except: print("连接结束") conn.close() break if str(send).upper() == 'QUIT': conn.close() break# 改写线程class msgThread(threading.Thread): def __init__(self, conn, flag): threading.Thread.__init__(self) self.conn = conn self.flag = flag def run(self): if self.flag == 1: ReceiveMsg(self.conn) else: SendMsg(self.conn)# 声明socket类型,同时生成链接对象def Client(address): client = socket.socket(socket.AF_INET, socket.SOCK_STREAM) while True: try: client.connect((address, 6999)) # 建立一个链接,连接到本地的6999端口 break except: print("等待侦听!") time.sleep(1) Thread_receive = msgThread(client, 1) Thread_send = msgThread(client, 2) Thread_receive.start() Thread_send.start()def Server(): server = socket.socket(socket.AF_INET, socket.SOCK_STREAM) server.bind(('127.0.0.1', 6999)) # 绑定要监听的端口 server.listen(5) # 开始监听 表示可以使用五个链接排队 conn, addr = server.accept() # 等待链接,多个链接的时候就会出现问题,其实返回了两个值 print("侦听器已启动!port:6999") print(conn, addr) Thread_receive = msgThread(conn, 1) Thread_send = msgThread(conn, 2) Thread_receive.start() Thread_send.start()if __name__ == "__main__": BootMode = input("请选择启动方式(1或2):\n") if BootMode == '1': Server() else: port = input("请输入侦听服务器地址(默认127.0.0.1):\n") Client(port)
]]>
特征工程,是指用一系列工程化的方式从原始数据中筛选出更好的数据特征,以提升模型的训练效果。业内有一句广为流传的话是:数据和特征决定了机器学习的上限,而模型和算法是在逼近这个上限而已。由此可见,好的数据和特征是模型和算法发挥更大的作用的前提。特征工程通常包括数据预处理、特征选择、降维等环节。如下图所示:
我们经常在处理数据时,会面临以下问题:
SQL
数据库、JSON
、CSV
等)而减少统计分析期间要使用的特征的数量可能会带来一些好处,例如:
事实上,统计上证明,当执行机器学习任务时,存在针对每个特定任务应该使用的最佳数量的特征(图 1)。如果添加的特征比必要的特征多,那么我们的模型性能将下降(因为添加了噪声)。真正的挑战是找出哪些特征是最佳的使用特征(这实际上取决于我们提供的数据量和我们正在努力实现的任务的复杂性)。这就是特征选择技术能够帮到我们的地方!
这是一道来自于天池的新手练习题目,用数据分析
、机器学习
等手段进行 二手车售卖价格预测 的回归问题。赛题本身的思路清晰明了,即对给定的数据集进行分析探讨,然后设计模型运用数据进行训练,测试模型,最终给出选手的预测结果。前面我们已经进行过EDA分析在这里天池_二手车价格预测_Task1-2_赛题理解与数据分析
赛题官方给出了来自Ebay Kleinanzeigen的二手车交易记录,总数据量超过40w,包含31列变量信息,其中15列为匿名变量,即v0
至v15
。并从中抽取15万条作为训练集,5万条作为测试集A,5万条作为测试集B,同时对name
、model
、brand
和regionCode
等信息进行脱敏。具体的数据表如下图:
Field | Description |
---|---|
SaleID | 交易ID,唯一编码 |
name | 汽车交易名称,已脱敏 |
regDate | 汽车注册日期,例如20160101,2016年01月01日 |
model | 车型编码,已脱敏 |
brand | 汽车品牌,已脱敏 |
bodyType | 车身类型:豪华轿车:0,微型车:1,厢型车:2,大巴车:3,敞篷车:4,双门汽车:5,商务车:6,搅拌车:7 |
fuelType | 燃油类型:汽油:0,柴油:1,液化石油气:2,天然气:3,混合动力:4,其他:5,电动:6 |
gearbox | 变速箱:手动:0,自动:1 |
power | 发动机功率:范围 [ 0, 600 ] |
kilometer | 汽车已行驶公里,单位万km |
notRepairedDamage | 汽车有尚未修复的损坏:是:0,否:1 |
regionCode | 地区编码,已脱敏 |
seller | 销售方:个体:0,非个体:1 |
offerType | 报价类型:提供:0,请求:1 |
creatDate | 汽车上线时间,即开始售卖时间 |
price | 二手车交易价格(预测目标) |
v系列特征 | 匿名特征,包含v0-14在内15个匿名特征 |
import pandas as pdimport numpy as npimport matplotlibimport matplotlib.pyplot as pltimport seaborn as snsfrom operator import itemgetter%matplotlib inline
train = pd.read_csv('used_car_train_20200313.csv', sep=' ')test = pd.read_csv('used_car_testA_20200313.csv', sep=' ')print(train.shape)print(test.shape)
(150000, 31)(50000, 30)
train.head()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | ... | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | ... | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | ... | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | ... | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | ... | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 |
5 rows × 31 columns
train.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode', 'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'], dtype='object')
这里可以将箱型图中的超过上下限的那些值作为异常值删除。如下图所示,箱型图中间是一个箱体,也就是粉红色部分,箱体左边,中间,右边分别有一条线,左边是下分位数($Q1$),右边是上四分位数($Q3$),中间是中位数($Median$),上下四分位数之差是四分位距$IQR(Interquartile Range$,用$Q1-1.5IQR$得到下边缘(最小值),$Q3+1.5IQR$得到上边缘(最大值)。在上边缘之外的数据就是极大异常值,在下边缘之外的数据就是极小异常值。
搞清楚原理那我们就构造一个实现上述功能的函数吧!
def drop_outliers(data, col_name, scale = 1.5): """ 用于清洗异常值,默认用 box_plot(scale=1.5)进行清洗 :param data: 接收 pandas 数据格式 :param col_name: pandas 列名 :param scale: 尺度 :return: """ data_n = data.copy() data_series = data_n[col_name] IQR = scale * (data_series.quantile(0.75) - data_series.quantile(0.25)) # quantile是pd内置的求四分位的函数 val_low = data_series.quantile(0.25) - IQR # 下边缘 val_up = data_series.quantile(0.75) + IQR # 上边缘 rule_low = (data_series < val_low) # 下边缘的极小异常值的下标列表 rule_up = (data_series > val_up) # 上边缘的极大异常值的下标列表 index = np.arange(data_series.shape[0])[rule_low | rule_up] # | 运算就是说只要rule_low和rule_up中只要有一个值为True,就把这个下标取出来 print(index) print("Delete number is: {}".format(len(index))) data_n = data_n.drop(index) # 删除index对应下标的元素 data_n.reset_index(drop=True, inplace=True) #下文有介绍 print("Now column number is: {}".format(data_n.shape[0])) index_low = np.arange(data_series.shape[0])[rule_low] # 下边缘的异常数据的描述统计量 outliers = data_series.iloc[index_low] print("Description of data less than the lower bound is:") print(pd.Series(outliers).describe()) index_up = np.arange(data_series.shape[0])[rule_up] # 上边缘的异常数据的描述统计量 outliers = data_series.iloc[index_up] print("Description of data larger than the upper bound is:") print(pd.Series(outliers).describe()) fig, ax = plt.subplots(1, 2, figsize = (10, 7)) sns.boxplot(y = data[col_name], data = data, palette = "Set1", ax = ax[0]) sns.boxplot(y = data_n[col_name], data = data_n, palette = "Set1", ax = ax[1]) return data_n
这里reset_index
可以还原索引,重新变为默认的整型索引
DataFrame.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill=”)
level
:int
、str
、tuple
或list
,默认无,仅从索引中删除给定级别。默认情况下移除所有级别。控制了具体要还原的那个等级的索引drop
:drop
为False
则索引列会被还原为普通列,否则会丢失inplace
:默认为False
,适当修改DataFrame
(不要创建新对象)col_level
:int
或str
,默认值为0,如果列有多个级别,则确定将标签插入到哪个级别。默认情况下,它将插入到第一级。col_fill
:对象,默认‘’,如果列有多个级别,则确定其他级别的命名方式。如果没有,则重复索引名drop_outliers(train, 'power', scale=1.5)
[ 33 77 104 ... 149967 149981 149984]Delete number is: 4878Now column number is: 145122Description of data less than the lower bound is:count 0.0mean NaNstd NaNmin NaN25% NaN50% NaN75% NaNmax NaNName: power, dtype: float64Description of data larger than the upper bound is:count 4878.000000mean 410.132021std 884.219933min 264.00000025% 286.00000050% 306.00000075% 349.000000max 19312.000000Name: power, dtype: float64
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | ... | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | ... | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | ... | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | ... | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | ... | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
145117 | 149995 | 163978 | 20000607 | 121.0 | 10 | 4.0 | 0.0 | 1.0 | 163 | 15.0 | ... | 0.280264 | 0.000310 | 0.048441 | 0.071158 | 0.019174 | 1.988114 | -2.983973 | 0.589167 | -1.304370 | -0.302592 |
145118 | 149996 | 184535 | 20091102 | 116.0 | 11 | 0.0 | 0.0 | 0.0 | 125 | 10.0 | ... | 0.253217 | 0.000777 | 0.084079 | 0.099681 | 0.079371 | 1.839166 | -2.774615 | 2.553994 | 0.924196 | -0.272160 |
145119 | 149997 | 147587 | 20101003 | 60.0 | 11 | 1.0 | 1.0 | 0.0 | 90 | 6.0 | ... | 0.233353 | 0.000705 | 0.118872 | 0.100118 | 0.097914 | 2.439812 | -1.630677 | 2.290197 | 1.891922 | 0.414931 |
145120 | 149998 | 45907 | 20060312 | 34.0 | 10 | 3.0 | 1.0 | 0.0 | 156 | 15.0 | ... | 0.256369 | 0.000252 | 0.081479 | 0.083558 | 0.081498 | 2.075380 | -2.633719 | 1.414937 | 0.431981 | -1.659014 |
145121 | 149999 | 177672 | 19990204 | 19.0 | 28 | 6.0 | 0.0 | 1.0 | 193 | 12.5 | ... | 0.284475 | 0.000000 | 0.040072 | 0.062543 | 0.025819 | 1.978453 | -3.179913 | 0.031724 | -1.483350 | -0.342674 |
145122 rows × 31 columns
从这张删除异常值前后的箱型图对比可以看出,剔除异常值后,数据的分布就很均匀了。
下面我们就批量对所有的特征进行一次异常数据删除:
def Bach_drop_outliers(data,scale=1.5): dataNew = data.copy() for fea in data.columns: try: IQR = scale * (dataNew[fea].quantile(0.75) - dataNew[fea].quantile(0.25)) # quantile是pd内置的求四分位的函数 except: continue val_low = dataNew[fea].quantile(0.25) - IQR # 下边缘 val_up = dataNew[fea].quantile(0.75) + IQR # 上边缘 rule_low = (dataNew[fea] < val_low) # 下边缘的极小异常值的下标列表 rule_up = (dataNew[fea] > val_up) # 上边缘的极大异常值的下标列表 index = np.arange(dataNew[fea].shape[0])[rule_low | rule_up] # | 运算就是说只要rule_low和rule_up中只要有一个值为True,就把这个下标取出来 print("feature %s deleted number is %d"%(fea, len(index))) dataNew = dataNew.drop(index)# 删除index对应下标的元素 dataNew.reset_index(drop=True, inplace=True) fig, ax = plt.subplots(5, 6, figsize = (20, 15)) x = 0 y = 0 for fea in dataNew.columns: try: sns.boxplot(y = dataNew[fea], data =dataNew, palette = "Set2", ax = ax[x][y]) y+=1 if y == 6: y = 0 x += 1 except: print(fea) y+=1 if y == 6: y = 0 x += 1 continue return dataNewtrain = Bach_drop_outliers(train)
feature SaleID deleted number is 0feature name deleted number is 0feature regDate deleted number is 0feature model deleted number is 9720feature brand deleted number is 4032feature bodyType deleted number is 5458feature fuelType deleted number is 333feature gearbox deleted number is 26829feature power deleted number is 1506feature kilometer deleted number is 15306feature regionCode deleted number is 4feature seller deleted number is 1feature offerType deleted number is 0feature creatDate deleted number is 13989feature price deleted number is 4527feature v_0 deleted number is 2558feature v_1 deleted number is 0feature v_2 deleted number is 487feature v_3 deleted number is 173feature v_4 deleted number is 61feature v_5 deleted number is 0feature v_6 deleted number is 0feature v_7 deleted number is 64feature v_8 deleted number is 0feature v_9 deleted number is 24feature v_10 deleted number is 0feature v_11 deleted number is 0feature v_12 deleted number is 4feature v_13 deleted number is 0feature v_14 deleted number is 1944notRepairedDamagev_14
可以看出,经过箱型图异常值删除后,新数据的箱型图的数据几乎没有异常值了,甚至有些箱型图的数据是一条直线,当然那是因为数据本身就是种类非0即1。
训练集和测试集放在一起,方便构造特征
train['train'] = 1test['train'] = 0data = pd.concat([train, test], ignore_index=True, sort=False)data
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | train | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | ... | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 | 1 |
1 | 5 | 137642 | 20090602 | 24.0 | 10 | 0.0 | 1.0 | 0.0 | 109 | 10.0 | ... | 0.000518 | 0.119838 | 0.090922 | 0.048769 | 1.885526 | -2.721943 | 2.457660 | -0.286973 | 0.206573 | 1 |
2 | 7 | 165346 | 19990706 | 26.0 | 14 | 1.0 | 0.0 | 0.0 | 101 | 15.0 | ... | 0.000000 | 0.122943 | 0.039839 | 0.082413 | 3.693829 | -0.245014 | -2.192810 | 0.236728 | 0.195567 | 1 |
3 | 10 | 18961 | 20050811 | 19.0 | 9 | 3.0 | 1.0 | 0.0 | 101 | 15.0 | ... | 0.105385 | 0.077271 | 0.042445 | 0.060794 | -4.206000 | 1.060391 | -0.647515 | -0.191194 | 0.349187 | 1 |
4 | 13 | 8129 | 20041110 | 65.0 | 1 | 0.0 | 0.0 | 0.0 | 150 | 15.0 | ... | 0.106950 | 0.134945 | 0.050364 | 0.051359 | -4.614692 | 0.821889 | 0.753490 | -0.886425 | -0.341562 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
112975 | 199995 | 20903 | 19960503 | 4.0 | 4 | 4.0 | 0.0 | 0.0 | 116 | 15.0 | ... | 0.130044 | 0.049833 | 0.028807 | 0.004616 | -5.978511 | 1.303174 | -1.207191 | -1.981240 | -0.357695 | 0 |
112976 | 199996 | 708 | 19991011 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 75 | 15.0 | ... | 0.108095 | 0.066039 | 0.025468 | 0.025971 | -3.913825 | 1.759524 | -2.075658 | -1.154847 | 0.169073 | 0 |
112977 | 199997 | 6693 | 20040412 | 49.0 | 1 | 0.0 | 1.0 | 1.0 | 224 | 15.0 | ... | 0.105724 | 0.117652 | 0.057479 | 0.015669 | -4.639065 | 0.654713 | 1.137756 | -1.390531 | 0.254420 | 0 |
112978 | 199998 | 96900 | 20020008 | 27.0 | 1 | 0.0 | 0.0 | 1.0 | 334 | 15.0 | ... | 0.000490 | 0.137366 | 0.086216 | 0.051383 | 1.833504 | -2.828687 | 2.465630 | -0.911682 | -2.057353 | 0 |
112979 | 199999 | 193384 | 20041109 | 166.0 | 6 | 1.0 | NaN | 1.0 | 68 | 9.0 | ... | 0.000300 | 0.103534 | 0.080625 | 0.124264 | 2.914571 | -1.135270 | 0.547628 | 2.094057 | -1.552150 | 0 |
112980 rows × 32 columns
data['creatDate']
- data['regDate']
,反应汽车使用时间,一般来说价格与使用时间成反比data['used_time'] = (pd.to_datetime(data['creatDate'], format='%Y%m%d', errors='coerce') - pd.to_datetime(data['regDate'], format='%Y%m%d', errors='coerce')).dt.daysdata['used_time']
0 4757.01 2482.02 6108.03 3874.04 4154.0 ... 112975 7261.0112976 6014.0112977 4345.0112978 NaN112979 4151.0Name: used_time, Length: 112980, dtype: float64
data['used_time'].isnull().sum()
8591
data.isnull().sum().sum()
70585
data['city'] = data['regionCode'].apply(lambda x : str(x)[:-3])data['city']
0 41 32 43 14 3 ..112975 3112976 1112977 3112978 1112979 3Name: city, Length: 112980, dtype: object
计算某品牌的销售统计量,这里要以 train 的数据计算统计量。
train_gb = train.groupby("brand")all_info = {}for kind, kind_data in train_gb: info = {} kind_data = kind_data[kind_data['price'] > 0] # kind_data['price'] > 0 返回的是下标再取一次列表就得到了数据 info['brand_amount'] = len(kind_data) info['brand_price_max'] = kind_data.price.max() info['brand_price_median'] = kind_data.price.median() info['brand_price_min'] = kind_data.price.min() info['brand_price_sum'] = kind_data.price.sum() info['brand_price_std'] = kind_data.price.std() info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2) all_info[kind] = infobrand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})data = data.merge(brand_fe, how='left', on='brand')data
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | train | used_time | city | brand_amount | brand_price_max | brand_price_median | brand_price_min | brand_price_sum | brand_price_std | brand_price_average | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | ... | 1 | 4757.0 | 4 | 4940.0 | 9500.0 | 2999.0 | 149.0 | 17934852.0 | 2537.956443 | 3629.80 |
1 | 5 | 137642 | 20090602 | 24.0 | 10 | 0.0 | 1.0 | 0.0 | 109 | 10.0 | ... | 1 | 2482.0 | 3 | 3557.0 | 9500.0 | 2490.0 | 200.0 | 10936962.0 | 2180.881827 | 3073.91 |
2 | 7 | 165346 | 19990706 | 26.0 | 14 | 1.0 | 0.0 | 0.0 | 101 | 15.0 | ... | 1 | 6108.0 | 4 | 8784.0 | 9500.0 | 1350.0 | 13.0 | 17445064.0 | 1797.704405 | 1985.78 |
3 | 10 | 18961 | 20050811 | 19.0 | 9 | 3.0 | 1.0 | 0.0 | 101 | 15.0 | ... | 1 | 3874.0 | 1 | 4487.0 | 9500.0 | 1250.0 | 55.0 | 7867901.0 | 1556.621159 | 1753.10 |
4 | 13 | 8129 | 20041110 | 65.0 | 1 | 0.0 | 0.0 | 0.0 | 150 | 15.0 | ... | 1 | 4154.0 | 3 | 4940.0 | 9500.0 | 2999.0 | 149.0 | 17934852.0 | 2537.956443 | 3629.80 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
112975 | 199995 | 20903 | 19960503 | 4.0 | 4 | 4.0 | 0.0 | 0.0 | 116 | 15.0 | ... | 0 | 7261.0 | 3 | 6368.0 | 9500.0 | 3000.0 | 150.0 | 24046576.0 | 2558.650243 | 3775.57 |
112976 | 199996 | 708 | 19991011 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 75 | 15.0 | ... | 0 | 6014.0 | 1 | 16371.0 | 9500.0 | 2150.0 | 50.0 | 46735356.0 | 2276.755156 | 2854.59 |
112977 | 199997 | 6693 | 20040412 | 49.0 | 1 | 0.0 | 1.0 | 1.0 | 224 | 15.0 | ... | 0 | 4345.0 | 3 | 4940.0 | 9500.0 | 2999.0 | 149.0 | 17934852.0 | 2537.956443 | 3629.80 |
112978 | 199998 | 96900 | 20020008 | 27.0 | 1 | 0.0 | 0.0 | 1.0 | 334 | 15.0 | ... | 0 | NaN | 1 | 4940.0 | 9500.0 | 2999.0 | 149.0 | 17934852.0 | 2537.956443 | 3629.80 |
112979 | 199999 | 193384 | 20041109 | 166.0 | 6 | 1.0 | NaN | 1.0 | 68 | 9.0 | ... | 0 | 4151.0 | 3 | 5778.0 | 9500.0 | 1400.0 | 50.0 | 11955982.0 | 1871.933447 | 2068.87 |
112980 rows × 41 columns
brand_fe
brand | brand_amount | brand_price_max | brand_price_median | brand_price_min | brand_price_sum | brand_price_std | brand_price_average | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 16371.0 | 9500.0 | 2150.0 | 50.0 | 46735356.0 | 2276.755156 | 2854.59 |
1 | 1 | 4940.0 | 9500.0 | 2999.0 | 149.0 | 17934852.0 | 2537.956443 | 3629.80 |
2 | 3 | 665.0 | 9500.0 | 2800.0 | 99.0 | 2158773.0 | 2058.532395 | 3241.40 |
3 | 4 | 6368.0 | 9500.0 | 3000.0 | 150.0 | 24046576.0 | 2558.650243 | 3775.57 |
4 | 5 | 2842.0 | 9500.0 | 1850.0 | 75.0 | 6562224.0 | 1738.415572 | 2308.20 |
5 | 6 | 5778.0 | 9500.0 | 1400.0 | 50.0 | 11955982.0 | 1871.933447 | 2068.87 |
6 | 7 | 1035.0 | 9500.0 | 1500.0 | 100.0 | 2372550.0 | 2071.320262 | 2290.11 |
7 | 8 | 705.0 | 9500.0 | 1100.0 | 125.0 | 1077211.0 | 1318.748474 | 1525.79 |
8 | 9 | 4487.0 | 9500.0 | 1250.0 | 55.0 | 7867901.0 | 1556.621159 | 1753.10 |
9 | 10 | 3557.0 | 9500.0 | 2490.0 | 200.0 | 10936962.0 | 2180.881827 | 3073.91 |
10 | 11 | 1390.0 | 9500.0 | 1750.0 | 50.0 | 3513591.0 | 2151.572044 | 2525.95 |
11 | 12 | 549.0 | 9500.0 | 1850.0 | 100.0 | 1413264.0 | 2091.218447 | 2569.57 |
12 | 13 | 1689.0 | 8950.0 | 1250.0 | 25.0 | 2832005.0 | 1363.018568 | 1675.74 |
13 | 14 | 8784.0 | 9500.0 | 1350.0 | 13.0 | 17445064.0 | 1797.704405 | 1985.78 |
14 | 15 | 389.0 | 9500.0 | 5700.0 | 1800.0 | 2247357.0 | 1795.404288 | 5762.45 |
15 | 16 | 291.0 | 8900.0 | 1950.0 | 300.0 | 636703.0 | 1223.490908 | 2180.49 |
16 | 17 | 542.0 | 9500.0 | 1970.0 | 150.0 | 1444129.0 | 2136.402905 | 2659.54 |
17 | 18 | 66.0 | 8990.0 | 1650.0 | 150.0 | 167360.0 | 2514.210817 | 2497.91 |
18 | 19 | 341.0 | 9100.0 | 1200.0 | 130.0 | 540335.0 | 1337.203100 | 1579.93 |
19 | 20 | 514.0 | 8150.0 | 1200.0 | 100.0 | 818973.0 | 1276.623577 | 1590.24 |
20 | 21 | 527.0 | 8900.0 | 1890.0 | 99.0 | 1285258.0 | 1832.524896 | 2434.20 |
21 | 22 | 222.0 | 9300.0 | 1925.0 | 190.0 | 592296.0 | 2118.280894 | 2656.04 |
22 | 23 | 68.0 | 9500.0 | 1194.5 | 100.0 | 110253.0 | 1754.883573 | 1597.87 |
23 | 24 | 4.0 | 8600.0 | 7550.0 | 5999.0 | 29699.0 | 1072.435041 | 5939.80 |
24 | 25 | 735.0 | 9500.0 | 1500.0 | 100.0 | 1725999.0 | 2152.726491 | 2345.11 |
25 | 26 | 121.0 | 9500.0 | 2699.0 | 300.0 | 417260.0 | 2563.586943 | 3420.16 |
数据分箱(也称为离散分箱或分段)是一种数据预处理技术,用于减少次要观察误差的影响,是一种将多个连续值分组为较少数量的“分箱”的方法。例如我们有各个年龄的数据的统计值,可以分成某个段的年龄的值。
下面以power
为例子,做一次数据分桶
bin = [i*10 for i in range(31)]data['power_bin'] = pd.cut(data['power'], bin, labels=False)data[['power_bin', 'power']]
power_bin | power | |
---|---|---|
0 | NaN | 0 |
1 | 10.0 | 109 |
2 | 10.0 | 101 |
3 | 10.0 | 101 |
4 | 14.0 | 150 |
... | ... | ... |
112975 | 11.0 | 116 |
112976 | 7.0 | 75 |
112977 | 22.0 | 224 |
112978 | NaN | 334 |
112979 | 6.0 | 68 |
112980 rows × 2 columns
可以看出这个分箱的作用就是将同一个区间段的功率值设为同样的值,比如101~109都设置为10.0。
然后就可以删除掉原数据了:
data = data.drop(['creatDate', 'regDate', 'regionCode'], axis=1)
print(data.shape)data.columns
(112980, 39)Index(['SaleID', 'name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'seller', 'offerType', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14', 'train', 'used_time', 'city', 'brand_amount', 'brand_price_max', 'brand_price_median', 'brand_price_min', 'brand_price_sum', 'brand_price_std', 'brand_price_average', 'power_bin'], dtype='object')
至此,可以导出给树模型用的数据
data.to_csv('data_for_tree.csv', index=0)
上面的步骤就是一次比较完备的特征构造,我们还可以为其他模型构造特征,主要是由于不用模型需要的数据输入是不同的。
观察一下数据分布
data['power'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2108b6377b8>
再看看train
数据集的分布:
train['power'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2108b4ed588>
我们对其取 log,再做归一化
data['power'] = np.log(data['power'] + 1) data['power'] = ((data['power'] - np.min(data['power'])) / (np.max(data['power']) - np.min(data['power'])))data['power'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2108abc1438>
看看行驶里程的情况,应该是原始数据已经分好了桶
data['kilometer'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2108abc1390>
归一化
data['kilometer'] = ((data['kilometer'] - np.min(data['kilometer'])) / (np.max(data['kilometer']) - np.min(data['kilometer'])))data['kilometer'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2108aca0898>
对刚刚构造的统计量进行归一化
def max_min(x): return (x - np.min(x)) / (np.max(x) - np.min(x))data.columns[-10:]
Index(['used_time', 'city', 'brand_amount', 'brand_price_max', 'brand_price_median', 'brand_price_min', 'brand_price_sum', 'brand_price_std', 'brand_price_average', 'power_bin'], dtype='object')
for i in data.columns[-10:]: if np.min(data[i]) != '': # 存在空值的情况 data[i] = max_min(data[i])
对类别特征进行$OneEncoder$
在此之前先介绍一下$OneEncoder$编码:$one-hot$的基本思想,将离散型特征的每一种取值都看成一种状态,若你的这一特征中有$N$个不相同的取值,那么我们就可以将该特征抽象成$N$种不同的状态,$one-hot$编码保证了每一个取值只会使得一种状态处于“激活态”,也就是说这N种状态中只有一个状态位值为1,其他状态位都是0。举个例子,假设我们以学历为例,我们想要研究的类别为小学、中学、大学、硕士、博士五种类别,我们使用$one-hot$对其编码就会得到:
哑变量编码直观的解释就是任意的将一个状态位去除。还是拿上面的例子来说,我们用4个状态位就足够反应上述5个类别的信息,也就是我们仅仅使用前四个状态位 [0,0,0,0] 就可以表达博士了。只是因为对于一个我们研究的样本,他已不是小学生、也不是中学生、也不是大学生、又不是研究生,那么我们就可以默认他是博士,是不是。所以,我们用哑变量编码可以将上述5类表示成:
data = pd.get_dummies(data, columns=['model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'power_bin'])data
SaleID | name | power | kilometer | seller | offerType | price | v_0 | v_1 | v_2 | ... | power_bin_0.6896551724137931 | power_bin_0.7241379310344828 | power_bin_0.7586206896551724 | power_bin_0.7931034482758621 | power_bin_0.8275862068965517 | power_bin_0.8620689655172413 | power_bin_0.896551724137931 | power_bin_0.9310344827586207 | power_bin_0.9655172413793104 | power_bin_1.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2262 | 0.000000 | 1.000000 | 0 | 0 | 3600.0 | 45.305273 | 5.236112 | 0.137925 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 5 | 137642 | 0.474626 | 0.655172 | 0 | 0 | 8000.0 | 46.323165 | -3.229285 | 0.156615 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 7 | 165346 | 0.467002 | 1.000000 | 0 | 0 | 1000.0 | 42.255586 | -3.167771 | -0.676693 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 10 | 18961 | 0.467002 | 1.000000 | 0 | 0 | 3100.0 | 45.401241 | 4.195311 | -0.370513 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 13 | 8129 | 0.506615 | 1.000000 | 0 | 0 | 3100.0 | 46.844574 | 4.175332 | 0.490609 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
112975 | 199995 | 20903 | 0.480856 | 1.000000 | 0 | 0 | NaN | 45.621391 | 5.958453 | -0.918571 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112976 | 199996 | 708 | 0.437292 | 1.000000 | 0 | 0 | NaN | 43.935162 | 4.476841 | -0.841710 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112977 | 199997 | 6693 | 0.546885 | 1.000000 | 0 | 0 | NaN | 46.537137 | 4.170806 | 0.388595 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112978 | 199998 | 96900 | 0.587076 | 1.000000 | 0 | 0 | NaN | 46.771359 | -3.296814 | 0.243566 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112979 | 199999 | 193384 | 0.427535 | 0.586207 | 0 | 0 | NaN | 43.731010 | -3.121867 | 0.027348 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112980 rows × 369 columns
data.to_csv('data_for_lr.csv', index=0)
相关性分析
print(data['power'].corr(data['price'], method='spearman'))print(data['kilometer'].corr(data['price'], method='spearman'))print(data['brand_amount'].corr(data['price'], method='spearman'))print(data['brand_price_average'].corr(data['price'], method='spearman'))print(data['brand_price_max'].corr(data['price'], method='spearman'))print(data['brand_price_median'].corr(data['price'], method='spearman'))
0.4698539569820024-0.199742825131185080.040858003200251270.31352395904129460.078941190892548270.3138873049004745
可以看出power
,brand_price_average
,brand_price_median
与price
相关性比较高
data_numeric = data[['power', 'kilometer', 'brand_amount', 'brand_price_average', 'brand_price_max', 'brand_price_median']]correlation = data_numeric.corr()f , ax = plt.subplots(figsize = (7, 7))plt.title('Correlation of Numeric Features with Price',y=1,size=30)sns.heatmap(correlation, square = True, cmap = 'PuBuGn', vmax=0.8)
<matplotlib.axes._subplots.AxesSubplot at 0x21096f60198>
看不出来啥。😛
!pip install mlxtend
Collecting mlxtend Downloading https://files.pythonhosted.org/packages/64/e2/1610a86284029abcad0ac9bc86cb19f9787fe6448ede467188b2a5121bb4/mlxtend-0.17.2-py2.py3-none-any.whl (1.3MB)Requirement already satisfied: setuptools in d:\software\anaconda\lib\site-packages (from mlxtend) (40.8.0)Requirement already satisfied: pandas>=0.24.2 in d:\software\anaconda\lib\site-packages (from mlxtend) (0.25.1)Requirement already satisfied: scipy>=1.2.1 in d:\software\anaconda\lib\site-packages (from mlxtend) (1.2.1)Requirement already satisfied: matplotlib>=3.0.0 in d:\software\anaconda\lib\site-packages (from mlxtend) (3.0.3)Requirement already satisfied: numpy>=1.16.2 in d:\software\anaconda\lib\site-packages (from mlxtend) (1.16.2)Collecting joblib>=0.13.2 (from mlxtend) Downloading https://files.pythonhosted.org/packages/28/5c/cf6a2b65a321c4a209efcdf64c2689efae2cb62661f8f6f4bb28547cf1bf/joblib-0.14.1-py2.py3-none-any.whl (294kB)Requirement already satisfied: scikit-learn>=0.20.3 in d:\software\anaconda\lib\site-packages (from mlxtend) (0.20.3)Requirement already satisfied: pytz>=2017.2 in d:\software\anaconda\lib\site-packages (from pandas>=0.24.2->mlxtend) (2018.9)Requirement already satisfied: python-dateutil>=2.6.1 in d:\software\anaconda\lib\site-packages (from pandas>=0.24.2->mlxtend) (2.8.0)Requirement already satisfied: cycler>=0.10 in d:\software\anaconda\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (0.10.0)Requirement already satisfied: kiwisolver>=1.0.1 in d:\software\anaconda\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (1.0.1)Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in d:\software\anaconda\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (2.3.1)Requirement already satisfied: six>=1.5 in d:\software\anaconda\lib\site-packages (from python-dateutil>=2.6.1->pandas>=0.24.2->mlxtend) (1.12.0)Installing collected packages: joblib, mlxtendSuccessfully installed joblib-0.14.1 mlxtend-0.17.2
x
SaleID | name | power | kilometer | seller | offerType | v_0 | v_1 | v_2 | v_3 | ... | power_bin_0.6896551724137931 | power_bin_0.7241379310344828 | power_bin_0.7586206896551724 | power_bin_0.7931034482758621 | power_bin_0.8275862068965517 | power_bin_0.8620689655172413 | power_bin_0.896551724137931 | power_bin_0.9310344827586207 | power_bin_0.9655172413793104 | power_bin_1.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2262 | 0.000000 | 1.000000 | 0 | 0 | 45.305273 | 5.236112 | 0.137925 | 1.380657 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 5 | 137642 | 0.474626 | 0.655172 | 0 | 0 | 46.323165 | -3.229285 | 0.156615 | -1.727217 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 7 | 165346 | 0.467002 | 1.000000 | 0 | 0 | 42.255586 | -3.167771 | -0.676693 | 1.942673 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 10 | 18961 | 0.467002 | 1.000000 | 0 | 0 | 45.401241 | 4.195311 | -0.370513 | 0.444251 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 13 | 8129 | 0.506615 | 1.000000 | 0 | 0 | 46.844574 | 4.175332 | 0.490609 | 0.085718 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
112975 | 199995 | 20903 | 0.480856 | 1.000000 | 0 | 0 | 45.621391 | 5.958453 | -0.918571 | 0.774826 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112976 | 199996 | 708 | 0.437292 | 1.000000 | 0 | 0 | 43.935162 | 4.476841 | -0.841710 | 1.328253 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112977 | 199997 | 6693 | 0.546885 | 1.000000 | 0 | 0 | 46.537137 | 4.170806 | 0.388595 | -0.704689 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112978 | 199998 | 96900 | 0.587076 | 1.000000 | 0 | 0 | 46.771359 | -3.296814 | 0.243566 | -1.277411 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112979 | 199999 | 193384 | 0.427535 | 0.586207 | 0 | 0 | 43.731010 | -3.121867 | 0.027348 | -0.808914 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
112980 rows × 368 columns
from mlxtend.feature_selection import SequentialFeatureSelector as SFSfrom sklearn.linear_model import LinearRegressionsfs = SFS(LinearRegression(), k_features=10, forward=True, floating=False, scoring = 'r2', cv = 0)x = data.drop(['price'], axis=1)x = x.fillna(0)y = data['price']x.dropna(axis=0, how='any', inplace=True)y.dropna(axis=0, how='any', inplace=True)sfs.fit(x, y)sfs.k_feature_names_
画出来,可以看到边际效益
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfsimport matplotlib.pyplot as pltfig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')plt.grid()plt.show()
Lasso 回归和决策树可以完成嵌入式特征选择,大部分情况下都是用嵌入式做特征筛选。
下一步就是建模了。🤔
]]> 我的Hexo
版本是3.9.0
,Next
主题版本是7.5
版本,也就是移除了custom
文件的神奇的跨时代的版本。
我的服务器不是github-page
,而是阿里云的ECS
服务器,关于如何将Hexo
从github-page
迁移到阿里云的ECS
服务器请看这篇文章:将博客部署到阿里云服务器上。
当然一开始觉得这个版本好搓卡,本来想改改样式,不会CSS
,上网搜搜就有很多改custom
文件的文章,复制复制就可以改成很好看的样式,这下好了,一移除就全部失效了。
其实不然,在与Next
的大坑中摸爬滚打了很久后,
发现,你可以在themes/next/source/css/main.styl
中最后加上一句:
@import "_custom/custom";
然后再在themes/next/source/css
目录下新建_custom
文件夹,再进去新建custom.styl
文件,将网上搜罗到的Next
主题的文件都粘贴进去,就可以在本地预览这些新添加的样式了。
🙆当然你也许不需要这么麻烦的操作,你甚至可以在themes/next/source/css
文件夹中的任何一个.styl
文件添加你想要的css
样式代码都可以在本地预览中生效。
我想其中的原因在于:主题调用的文件主要来自于themes/next/source/css/main.styl
,而这个文件里面全是import
语句,即将所有的css
文件import
进来,也就意味着最后生成的整体的main.css
文件不过是将所有的css
分文件中的语句按顺序排列罢了,所以你加在哪个文件改变的不过是最后的main.css
的语句顺序罢了,但是其提供的效果依然生效。但为了日后修改方便,还是建议找对应的位置添加。
前言中我也提到了“本地预览生效”的话,意味着,你大可自己定义css
样式,也可以将网上的内容复制粘贴,但有一点非常头疼,那就是大多数情况下,你只能成功地进行本地预览,而一旦deploy
到服务器上要么就是完全无效,要么就是稀奇古怪,甚至有时候你即使将整个/css
文件夹删除,发现deploy
后的网站样式完全没有变化。😢
紧接着通过在部署后的页面以及本地预览的页面分别进行F12
调试,逐一对比,终于发现了不一样的地方。本地预览时调试页面的/css
的文件下的文件名为main.css
就跟hexo g
生成在/public
文件夹下是一模一样的,但是到了部署页面中这个文件名就变为了main.css?v=7.3.0
,这多出来的?v=7.3.0
百思不得其解。再看看main.css
中的文件内容跟我pc里面的/public/css/main.css
里面的东西一模一样,但是main.css?v=7.3.0
里面莫名其妙的少了几百行,原以为是hexo deploy
命令部署的不全,漏了东西,但上阿里云的服务器文件夹里面一看内容跟我本地的一样。并且当我将main.css
里面的东西复制到main.css?v=7.3.0
时,我想要的部署的页面就跟我本地预览终于一样了,虽然刷新一下就没了,毕竟是网页调试。
那么问题就很清楚了,就是这个main.css?v=7.3.0
并不放在服务器端,调用的源头也并不明朗,而且是无法更改的,所以得想办法让部署的页面加载main.css
而不是main.css?v=7.3.0
。
既然知道问题出在哪里就很简单了,费了一番功夫,终于发现在themes/next/layout/_partials/head/head.swig
文件中,有一行语句是这样的:
<link rel="stylesheet" href="{{ url_for(theme.css) }}/main.css?v={{ version }}">
很明显之前多出来的?v=7.3.0
就是出自于这里的?v={{ version }}
,所以就把这里的?v={{ version }}
删除,就可以了。
然后再hexo clean && hexo g && hexo d
,查看部署端页面,样式齐全完美!问题解决。👍👍👍
这个方法可以完美地解决问题本身,而且绝对不会再出现本地预览与部署端不一样的问题,但是会出现副作用。
当我以为终于可以愉快地肆无忌惮地玩耍时,又发现一个新的问题。就是我又再次改造了css
样式,即在/_custom/custom
文件中加入了一些样式,再hexo d
发现样式没有变化。再调试发现问题,main.css
文件没有变化。思考一下,猜测这个原因应该跟main.css?v=7.3.0
问题是一样的,它本身是不可更改的,即使再hexo d
新的css
文件,其本身不会变化。
问题的解决方法就是将themes/next/layout/_partials/head/head.swig
中的
- <link rel="stylesheet" href="{{ url_for(theme.css) }}/main.css">+ <link rel="stylesheet" href="{{ url_for(theme.css) }}/main1.css">
也就是改成其他名字main1
也好main2
也好,就是改成跟原来不用的名字。然后继续hexo g
就会将生成的那些博客文章页面里面的引用的css
文件名改为main1.css
文件。
同时还要将/public/css
中的main.css
改为main1.css
,最后hexo d
,发现改动的css
样式也生效了。
当然这也就意味着,以后每次改动css
样式,都要将main.css
改成新的名字,如main2.css
、main3.css
......。
建议hexo clean && hexo g && hexo d
之前,先备份一下/public
文件夹,保留可以回退版本的可能。
所以建议要么不改css
,要么一次性改全。毕竟写Hexo
博客,重要的不是好看,而是内容,不是嘛?😉
themes/next/layout/_partials/head/head.swig
中的main.css?v={{ version }}
后面的?v={{ version }}
css
后,hexo d
之前,改造themes/next/layout/_partials/head/head.swig
中的main.css
的名字,如main2.css
、main3.css
......。hexo clean && hexo g && hexo d
之前,先备份一下/public
文件夹,保留可以回退版本的可能。