Tensorflow二分类处理dense或者sparse(文本分类)的输入数据

时间:2021-12-07 23:06:57

这里做了一些小的修改,感谢谷歌rd的帮助,使得能够统一处理dense的数据,或者类似文本分类这样sparse的输入数据。后续会做进一步学习优化,比如如何多线程处理。

具体如何处理sparse 主要是使用embedding_lookup_sparse,参考

https://github.com/tensorflow/tensorflow/issues/342

两个文件

melt.py

binary_classification.py

代码和数据已经上传到 https://github.com/chenghuige/tensorflow-example , 关于sparse处理可以先参考 sparse_tensor.py

运行

python ./binary_classification.py --tr corpus/feature.trate.0_2.normed.txt --te corpus/feature.trate.1_2.normed.txt --batch_size 200 --method mlp --num_epochs 1000

... loading dataset: corpus/feature.trate.0_2.normed.txt

0

10000

20000

30000

40000

50000

60000

70000

finish loading train set corpus/feature.trate.0_2.normed.txt

... loading dataset: corpus/feature.trate.1_2.normed.txt

0

10000

finish loading test set corpus/feature.trate.1_2.normed.txt

num_features: 4762348

trainSet size: 70968

testSet size: 17742

batch_size: 200 learning_rate: 0.001 num_epochs: 1000

I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 24

I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 24

I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 24

I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 24

0 auc: 0.503701159392 cost: 0.69074464019

1 auc: 0.574863035489 cost: 0.600787888115

2 auc: 0.615858601208 cost: 0.60036152958

3 auc: 0.641573172518 cost: 0.599917832685

4 auc: 0.657326531323 cost: 0.599433459447

5 auc: 0.666575623414 cost: 0.598856064529

6 auc: 0.671990014639 cost: 0.598072590816

7 auc: 0.675956442936 cost: 0.596850153855

8 auc: 0.681129512174 cost: 0.594744671454

9 auc: 0.689568680575 cost: 0.591011970184

10 auc: 0.70265083004 cost: 0.584730529957

11 auc: 0.720751242654 cost: 0.575319047846

12 auc: 0.740525668112 cost: 0.563041782476

13 auc: 0.756397606412 cost: 0.548790696159

14 auc: 0.76745782664 cost: 0.533633556673

15 auc: 0.776115284883 cost: 0.518648754985

16 auc: 0.783683301767 cost: 0.504702218341

17 auc: 0.79058754946 cost: 0.492255532423

18 auc: 0.796831772334 cost: 0.481419827863

19 auc: 0.802349672543 cost: 0.472143309749

20 auc: 0.807102186144 cost: 0.464346827091

21 auc: 0.811092646634 cost: 0.457953127862

22 auc: 0.814318813594 cost: 0.452874061637

23 auc: 0.816884839449 cost: 0.449003176388

24 auc: 0.818881302313 cost: 0.446225956373

从实验结果来看 简单的mlp 可以轻松超越linearSVM

mlt feature.trate.0_2.normed.txt -c tt -test feature.trate.1_2.normed.txt --iter 1000000

I1130 20:03:36.485967 18502 Melt.h:59] _cmd.randSeed --- [4281910087]

I1130 20:03:36.486151 18502 Melt.h:1209] omp_get_num_procs() --- [24]

I1130 20:03:36.486706 18502 Melt.h:1221] get_num_threads() --- [22]

I1130 20:03:36.486742 18502 Melt.h:1224] commandStr --- [tt]

I1130 20:03:36.486760 18502 time_util.h:102] TrainTest! started

I1130 20:03:36.486789 18502 time_util.h:102] ParseInputDataFile started

I1130 20:03:36.785362 18502 time_util.h:113] ParseInputDataFile finished using: [298.557 ms] (0.298551 s)

I1130 20:03:36.785481 18502 TrainerFactory.cpp:99] Creating LinearSVM trainer

I1130 20:03:36.785524 18502 time_util.h:102] Train started

MinMaxNormalizer prepare [ 70968 ] (0.193283 s)100% |******************************************|

I1130 20:03:37.064959 18502 time_util.h:102] Normalize started

I1130 20:03:37.096940 18502 time_util.h:113] Normalize finished using: [31.945 ms] (0.031939 s)

LinearSVM training [ 1000000 ] (1.14643 s)100% |******************************************|

Sigmoid/PlattCalibrator calibrating [ 70968 ] (0.139669 s)100% |******************************************|

I1130 20:03:38.383231 18502 Trainer.h:65] Param: [numIterations:1000000 learningRate:0.001 trainerTyper:peagsos loopType:stochastic sampleSize:1 performProjection:0 ]

I1130 20:03:38.457448 18502 time_util.h:113] Train finished using: [1671.9 ms] (1.6719 s)

I1130 20:03:38.506352 18502 time_util.h:102] ParseInputDataFile started

I1130 20:03:38.579484 18502 time_util.h:113] ParseInputDataFile finished using: [73.094 ms] (0.073092 s)

I1130 20:03:38.579563 18502 Melt.h:603] Test feature.trate.1_2.normed.txt and writting instance predict file to ./result/0.inst.txt

TEST POSITIVE RATIO:        0.2876 (5103/(5103+12639))

Confusion table:

||===============================||

|| PREDICTED ||

TRUTH || positive | negative || RECALL

||===============================||

positive|| 3195 | 1908 || 0.6261 (3195/5103)

negative|| 2137 | 10502 || 0.8309 (10502/12639)

||===============================||

PRECISION 0.5992 (3195/5332) 0.8463(10502/12410)

LOG-LOSS/instance:                0.4843

LOG-LOSS-PROB/instance:                0.6256

TEST-SET ENTROPY (prior LL/in):        0.6000

LOG-LOSS REDUCTION (RIG):        -4.2637%

OVERALL 0/1 ACCURACY:        0.7720 (13697/17742)

POS.PRECISION:                0.5992

POS.RECALL:                0.6261

NEG.PRECISION:                0.8463

NEG.RECALL:                0.8309

F1.SCORE:                 0.6124

OuputAUC: 0.7984

AUC: [0.7984]

----------------------------------------------------------------------------------------

I1130 20:03:38.729507 18502 time_util.h:113] TrainTest! finished using: [2242.72 ms] (2.24272 s)

#---------------------melt.py

#!/usr/bin/env python

#coding=gbk

# ==============================================================================

# \file melt.py

# \author chenghuige

# \date 2015-11-30 13:40:19.506009

# \Description

# ==============================================================================

import numpy as np

import os

#---------------------------melt load data

#Now support melt dense and sparse input file format, for sparse input no

#header

#for dense input will ignore header

#also support libsvm format @TODO

def guess_file_format(line):

is_dense = True

has_header = False

if line.startswith('#'):

has_header = True

return is_dense, has_header

elif line.find(':') > 0:

is_dense = False

return is_dense, has_header

def guess_label_index(line):

label_idx = 0

if line.startswith('_'):

label_idx = 1

return label_idx

#@TODO implement [a:b] so we can use [a:b] in application code

class Features(object):

def __init__(self):

self.data = []

def mini_batch(self, start, end):

return self.data[start: end]

def full_batch(self):

return self.data

class SparseFeatures(object):

def __init__(self):

self.sp_indices = []

self.start_indices = [0]

self.sp_ids_val = []

self.sp_weights_val = []

self.sp_shape = None

def mini_batch(self, start, end):

batch = SparseFeatures()

start_ = self.start_indices[start]

end_ = self.start_indices[end]

batch.sp_ids_val = self.sp_ids_val[start_: end_]

batch.sp_weights_val = self.sp_weights_val[start_: end_]

row_idx = 0

max_len = 0

#@TODO better way to construct sp_indices for each mini batch ?

for i in xrange(start + 1, end + 1):

len_ = self.start_indices[i] - self.start_indices[i - 1]

if len_ > max_len:

max_len = len_

for j in xrange(len_):

batch.sp_indices.append([i - start - 1, j])

row_idx += 1

batch.sp_shape = [end - start, max_len]

return batch

def full_batch(self):

if len(self.sp_indices) == 0:

row_idx = 0

max_len = 0

for i in xrange(1, len(self.start_indices)):

len_ = self.start_indices[i] - self.start_indices[i - 1]

if len_ > max_len:

max_len = len_

for j in xrange(len_):

self.sp_indices.append([i - 1, j])

row_idx += 1

self.sp_shape = [len(self.start_indices) - 1, max_len]

return self

class DataSet(object):

def __init__(self):

self.labels = []

self.features = None

self.num_features = 0

def num_instances(self):

return len(self.labels)

def full_batch(self):

return self.features.full_batch(), self.labels

def mini_batch(self, start, end):

if end < 0:

end = num_instances() + end

return self.features.mini_batch(start, end), self.labels[start: end]

def load_dense_dataset(lines):

dataset_x = []

dataset_y = []

nrows = 0

label_idx = guess_label_index(lines[0])

for i in xrange(len(lines)):

if nrows % 10000 == 0:

print nrows

nrows += 1

line = lines[i]

l = line.rstrip().split()

dataset_y.append([float(l[label_idx])])

dataset_x.append([float(x) for x in l[label_idx + 1:]])

dataset_x = np.array(dataset_x)

dataset_y = np.array(dataset_y)

dataset = DataSet()

dataset.labels = dataset_y

dataset.num_features = dataset_x.shape[1]

features = Features()

features.data = dataset_x

dataset.features = features

return dataset

def load_sparse_dataset(lines):

dataset_x = []

dataset_y = []

label_idx = guess_label_index(lines[0])

num_features = int(lines[0].split()[label_idx + 1])

features = SparseFeatures()

nrows = 0

start_idx = 0

for i in xrange(len(lines)):

if nrows % 10000 == 0:

print nrows

nrows += 1

line = lines[i]

l = line.rstrip().split()

dataset_y.append([float(l[label_idx])])

start_idx += (len(l) - label_idx - 2)

features.start_indices.append(start_idx)

for item in l[label_idx + 2:]:

id, val = item.split(':')

features.sp_ids_val.append(int(id))

features.sp_weights_val.append(float(val))

dataset_y = np.array(dataset_y)

dataset = DataSet()

dataset.labels = dataset_y

dataset.num_features = num_features

dataset.features = features

return dataset

def load_dataset(dataset, has_header=False):

print '... loading dataset:',dataset

lines = open(dataset).readlines()

if has_header:

return load_dense_dataset(lines[1:])

is_dense, has_header = guess_file_format(lines[0])

if is_dense:

return load_dense_dataset(lines[has_header:])

else:

return load_sparse_dataset(lines)

#-----------------------------------------melt for tensorflow

import tensorflow as tf

def init_weights(shape):

return tf.Variable(tf.random_normal(shape, stddev = 0.01))

def matmul(X, w):

if type(X) == tf.Tensor:

return tf.matmul(X,w)

else:

return tf.nn.embedding_lookup_sparse(w, X[0], X[1], combiner = "sum")

class BinaryClassificationTrainer(object):

def __init__(self, dataset):

self.labels = dataset.labels

self.features = dataset.features

self.num_features = dataset.num_features

self.X = tf.placeholder("float", [None, self.num_features])

self.Y = tf.placeholder("float", [None, 1])

def gen_feed_dict(self, trX, trY):

return {self.X: trX, self.Y: trY}

class SparseBinaryClassificationTrainer(object):

def __init__(self, dataset):

self.labels = dataset.labels

self.features = dataset.features

self.num_features = dataset.num_features

self.sp_indices = tf.placeholder(tf.int64)

self.sp_shape = tf.placeholder(tf.int64)

self.sp_ids_val = tf.placeholder(tf.int64)

self.sp_weights_val = tf.placeholder(tf.float32)

self.sp_ids = tf.SparseTensor(self.sp_indices, self.sp_ids_val, self.sp_shape)

self.sp_weights = tf.SparseTensor(self.sp_indices, self.sp_weights_val, self.sp_shape)

self.X = (self.sp_ids, self.sp_weights)

self.Y = tf.placeholder("float", [None, 1])

def gen_feed_dict(self, trX, trY):

return {self.Y: trY, self.sp_indices: trX.sp_indices, self.sp_shape: trX.sp_shape, self.sp_ids_val: trX.sp_ids_val, self.sp_weights_val: trX.sp_weights_val}

def gen_binary_classification_trainer(dataset):

if type(dataset.features) == Features:

return BinaryClassificationTrainer(dataset)

else:

return SparseBinaryClassificationTrainer(dataset)

#------------------------- binary_classification.py

#!/usr/bin/env python

#coding=gbk

# ==============================================================================

# \file binary_classification.py

# \author chenghuige

# \date 2015-11-30 16:06:52.693026

# \Description

# ==============================================================================

import sys

import tensorflow as tf

import numpy as np

from sklearn.metrics import roc_auc_score

import melt

flags = tf.app.flags

FLAGS = flags.FLAGS

flags.DEFINE_float('learning_rate', 0.001, 'Initial learning rate.')

flags.DEFINE_integer('num_epochs', 120, 'Number of epochs to run trainer.')

flags.DEFINE_integer('batch_size', 500, 'Batch size. Must divide evenly into the dataset sizes.')

flags.DEFINE_string('train', './corpus/feature.normed.rand.12000.0_2.txt', 'train file')

flags.DEFINE_string('test', './corpus/feature.normed.rand.12000.1_2.txt', 'test file')

flags.DEFINE_string('method', 'logistic', 'currently support logistic/mlp')

#----for mlp

flags.DEFINE_integer('hidden_size', 20, 'Hidden unit size')

trainset_file = FLAGS.train

testset_file = FLAGS.test

learning_rate = FLAGS.learning_rate

num_epochs = FLAGS.num_epochs

batch_size = FLAGS.batch_size

method = FLAGS.method

trainset = melt.load_dataset(trainset_file)

print "finish loading train set ",trainset_file

testset = melt.load_dataset(testset_file)

print "finish loading test set ", testset_file

assert(trainset.num_features == testset.num_features)

num_features = trainset.num_features

print 'num_features: ', num_features

print 'trainSet size: ', trainset.num_instances()

print 'testSet size: ', testset.num_instances()

print 'batch_size:', batch_size, ' learning_rate:', learning_rate, ' num_epochs:', num_epochs

trainer = melt.gen_binary_classification_trainer(trainset)

class LogisticRegresssion:

def model(self, X, w):

return melt.matmul(X,w)

def run(self, trainer):

w = melt.init_weights([trainer.num_features, 1])

py_x = self.model(trainer.X, w)

return py_x

class Mlp:

def model(self, X, w_h, w_o):

h = tf.nn.sigmoid(melt.matmul(X, w_h)) # this is a basic mlp, think 2 stacked logistic regressions

return tf.matmul(h, w_o) # note that we dont take the softmax at the end because our cost fn does that for us

def run(self, trainer):

w_h = melt.init_weights([trainer.num_features, FLAGS.hidden_size]) # create symbolic variables

w_o = melt.init_weights([FLAGS.hidden_size, 1])

py_x = self.model(trainer.X, w_h, w_o)

return py_x

def gen_algo(method):

if method == 'logistic':

return LogisticRegresssion()

elif method == 'mlp':

return Mlp()

else:

print method, ' is not supported right now'

exit(-1)

algo = gen_algo(method)

py_x = algo.run(trainer)

Y = trainer.Y

cost = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(py_x, Y))

train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) # construct optimizer

predict_op = tf.nn.sigmoid(py_x)

sess = tf.Session()

init = tf.initialize_all_variables()

sess.run(init)

teX, teY = testset.full_batch()

num_train_instances = trainset.num_instances()

for i in range(num_epochs):

predicts, cost_ = sess.run([predict_op, cost], feed_dict = trainer.gen_feed_dict(teX, teY))

print i, 'auc:', roc_auc_score(teY, predicts), 'cost:', cost_ / len(teY)

for start, end in zip(range(0, num_train_instances, batch_size), range(batch_size, num_train_instances, batch_size)):

trX, trY = trainset.mini_batch(start, end)

sess.run(train_op, feed_dict = trainer.gen_feed_dict(trX, trY))

predicts, cost_ = sess.run([predict_op, cost], feed_dict = trainer.gen_feed_dict(teX, teY))

print 'final ', 'auc:', roc_auc_score(teY, predicts),'cost:', cost_ / len(teY)