训练
训练模型时最简单的命令如下
vw train_file –cache_file cache_train -f model_file
- train_file:训练数据,其格式见http://blog.csdn.net/zc02051126/article/details/47005229或者https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format
- –cache_file:设置缓存文件,VW在加载train_file时速度较慢,如果设置了缓存文件参数,在第一次运行时VW将会生成VW更善于读取的二进制缓存文件,加快读取速度。
- -f:设置输出的模型文件
预测
vw -t –cache_file cache_test -i model_file -p result.txt test_file
- -t:忽略数据中的标签信息,之对样本数据进行预测。
- –cache_file:缓存文件同上。
- i:设置用于预测未知数据的模型。
- p:设置预测数据的结果文件。
- test_file:被预测样本的数据文件。
其它参数详解
1 VW可选参数
- -h [ –help ]:查看帮助信息。
- –version:查看版本信息。
- –random_seed arg:设置产生随机数的种子。
- –noop:不学习。
2 输入参数
- -d [ –data ]:设置样本数据文件。
- –ring_size arg size of example ring
- –examples arg number of examples to parse
- –daemon read data from port 26542
- –port:监听端口。
- –num_children arg (=10) number of children for
persistent daemon mode - –pid_file arg Write pid file in
persistent daemon mode - –passes arg (=1):
模型训练的迭代次数,不设置时默认为迭代一次。
- -c [ –cache ]:使用缓存,默认情况下的缓存文件存储在.cache。
- –cache_file arg:设置缓存文件。
- –compressed:如果需要压缩时使用gzip压缩格式,如果需要产生缓存文件,则用压缩格式存储。在自动检测模式下,输入文件支持文本和压缩格式的混合。
- –no_stdin do not default to reading from stdin
- –save_resume save extra state so learning can be resumed
later with new data
Raw training/testing data (in the proper plain text input format) can be passed to VW in a number of ways:
Using the -d or --data options which expect a file name as an argument (specifying a file name that is not associated with any option also works);
Via stdin;
Via a TCP/IP port if the --daemon option is specified. The port itself is specified by --port otherwise the default port 26542 is used. The daemon by default creates 10 child processes which share the model state, allowing answering multiple simultaneous queries. The number of child processes can be controlled with --num_children, and you can create a file with the jobid using --pid_file which is later useful for killing the job.
Parsing raw data is slow so there are options to create or load data in VW’s native format. Files containing data in VW’s native format are called caches. The exact contents of a cache file depend on the input as well as a few options (-b, –affix, –spelling) that are passed to VW during the creation of the cache. This implies that using the cache file with different options might cause VW to rebuild the cache. The easiest way to use a cache is to always specify the -c option. This way, VW will first look for a cache file and create it if it doesn’t exist. To override the default cache file name use –cache_file followed by the file name.
–compressed can be used for reading gzipped raw training data, writing gzipped caches, and reading gzipped caches.
–passes takes as an argument the number of times the algorithm will cycle over the data (epochs).
6 权重设置参数
- -b [ –bit_precision ] arg number of bits in the feature table
- -i [ –initial_regressor ] arg Initial regressor(s) to load into memory (arg is filename)
- -f [ –final_regressor ] arg:
设置保存模型的文件。
- –random_weights arg make initial weights random
- –initial_weight arg (=0):将所有权重设置成初始值1。
- –readable_model arg:
输出可阅读的模型。
- –invert_hash arg:
输出可阅读的模型。
- –save_per_pass:
在每次训练之后都保存模型结果。
- –input_feature_regularizer arg Per feature regularization input file
- –output_feature_regularizer_binary arg Per feature regularization output file
- –output_feature_regularizer_text arg Per feature regularization output file, in text
VW hashes all features to a predetermined range [0,2^b-1] and uses a fixed weight vector with 2^b components. The argument of -b option determines the value of (b) which is 18 by default. Hashing the features allows the algorithm to work with very raw data (since there’s no need to assign a unique id to each feature) and has only a negligible effect on generalization performance (see for example Feature Hashing for Large Scale Multitask Learning.
在训练模型时用-f指定模型文件,当对模型进行重新训练(接着之前的训练结果继续训练),可以使用-i指定现有的模型。
–readable_model的功能和-f相同,也是指定保存模型的文件,只不过保存的结果不是二进制的,而是更适合阅读的文本模式,格式为
特征的hash值:特征权重
。
–invert_hash和–readable_model的功能类似,但是输出的模型更适合人的阅读习惯,
特征名称:特征的hash值:特征权重
,每个特征名称后面后会跟着特征的hash值,然后是特征权值。注意,使用–invert_hash参数会需要更多的计算资源,同时计算时间复杂度也会变大。 此时特征名字不会存储在缓存文件中 (如果有-c参数存在,而且存在缓存文件,而同时又想使用–invert_hash参数,则程序会将缓存文件删除,或者用参数-k使程序自动检测处理这种情况)。对与多分类学习,必须有参数-c存在,建议首先不设置参数–invert_hash对模型进行训练,然后把参数-t去掉 ,在加上–invert_hash参数,再次运行程序,这时程序只是读取之前的二进制模型(-i参数控制),将其转换为文本格式(–invert_hash参数控制)。
–save_per_pass saves:如果设置了此参数,在每次计算完成后都会存储模型结果。
–input_feature_regularizer, –output_feature_regularizer_binary,
–output_feature_regularizer_text are analogs of -i, -f, and
–readable_model for batch optimization where want to do per feature regularization. This is advanced, but allows efficient simulation of online learning with a batch optimizer.
By default VW starts with the zero vector as its hypothesis. The –random_weights option initializes with random weights. This is often useful for symmetry breaking in advanced models. It’s also possible to initialize with a fixed value such as the all-ones vector using –initial_weight.