原创文章,转载请注明: 转载自慢慢的回味
本文链接地址: 贝叶斯分类(classify-20newsgroups)
一 理论分析
Tackling the Poor Assumptions of Naive Bayes Text Classifiers
贝叶斯的多项式模型如下公式。表示一个文档由一系列单词构成。为在类c的条件下,当前文档为文档d的概率。
表示类c的参数向量,表示文档共有m个类,一个类向量由n个单词的概率参数表示。如表示类c中单词i的概率。
为了求文档d的似然概率,通常再加上类c的先验概率就可得到d的似然函数。不过往往先验概率都相同。
似然函数的关键是求,根据论文Heckerman, D. (1995). A tutorial on learning with Bayesian networks 得到单词i在类c中的似然概率为
似然函数就为
Mahout的贝叶斯分类默认每个类的先验概率相同,所以执行比较取最大值时的c就为目标类别,这是StandardNaiveBayesClassifier.java的逻辑。
当为ComplementaryNaiveBayesClassifier.java时,公式对应为
其中,表示单词i不属于c的次数。即越大表示越不属于c,和上面的 StandardNaiveBayesClassifier相反。
一般情况下,都比要大,即样本更均衡,所以前者得到似然概率将更精确。
二 代码分析
1 从20newsgroups文件中创建sequence文件
./bin/mahout seqdirectory \
-i ${WORK_DIR}/20news-all \
-o ${WORK_DIR}/20news-seq -ow
输入为每个分类目录下的文件,输出为sequence文件,key为文件名,value为文件内容。
2 从sequence文件创建vectors
./bin/mahout seq2sparse \
-i ${WORK_DIR}/20news-seq \
-o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
参数:-lnorm输出需要log规范化;-nv输出为NamedVectors;-wt tfidf词频统计模型,参见http://zh.wikipedia.org/zh/TF-IDF
DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf); |
把文件进行tokenize,输出为{doc1,[term1,…],….}
if (processIdf) { DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize, minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors); } |
这个函数首先统计所有文档所用的的单词库即每个单词的使用次数,输出为{word1,num1,…}
然后统计每个文档的单词使用数,输出为{doc1,[Num-term1,…],…},term是word在单词库里对应的序号。
docFrequenciesFeatures = TFIDFConverter.calculateDF(new Path(outputDir, tfDirName), outputDir, conf, chunkSize); |
计算所有文档每个单词的使用次数,输出为{-1,docs-num,term1,num1,…}
TFIDFConverter.processTfIdf( new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), outputDir, conf, docFrequenciesFeatures, minDf, maxDF, norm, logNormalize, sequentialAccessOutput, namedVectors, reduceTasks); |
计算输出的每个term的idf值,输出为{doc1,[Num-term-idf1,…],…}
3 分割训练和测试样本
./bin/mahout split \
-i ${WORK_DIR}/20news-vectors/tfidf-vectors \
–trainingOutput ${WORK_DIR}/20news-train-vectors \
–testOutput ${WORK_DIR}/20news-test-vectors \
–randomSelectionPct 40 –overwrite –sequenceFiles -xm sequential
4 训练贝叶斯模型
./bin/mahout trainnb \
-i ${WORK_DIR}/20news-train-vectors -el \
-o ${WORK_DIR}/model \
-li ${WORK_DIR}/labelindex \
-ow $c
long labelSize = createLabelIndex(labPath); |
得到当前训练集的类目。
Job indexInstances = prepareJob(getInputPath(), getTempPath(SUMMED_OBSERVATIONS), SequenceFileInputFormat.class, IndexInstancesMapper.class, IntWritable.class, VectorWritable.class, VectorSumReducer.class, IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class); indexInstances.setCombinerClass(VectorSumReducer.class); boolean succeeded = indexInstances.waitForCompletion(true); if (!succeeded) { return -1; } |
统计每一类的所有文档term的和,并把类目转换为LabelIndex对应的index值。
对应的输出为{class1,[Num-term1,…],…}}
Job weightSummer = prepareJob(getTempPath(SUMMED_OBSERVATIONS), getTempPath(WEIGHTS), SequenceFileInputFormat.class, WeightsMapper.class, Text.class, VectorWritable.class, VectorSumReducer.class, Text.class, VectorWritable.class, SequenceFileOutputFormat.class); weightSummer.getConfiguration().set(WeightsMapper.NUM_LABELS, String.valueOf(labelSize)); weightSummer.setCombinerClass(VectorSumReducer.class); |
统计每一类的权重和每一feature(term)的权重 ,
输出为{WEIGHTS_PER_FEATURE,[feature1,num1,…],WEIGHTS_PER_LABEL,[label1,num1,…]}
5 测试贝叶斯模型
./bin/mahout testnb \
-i ${WORK_DIR}/20news-test-vectors\
-m ${WORK_DIR}/model \
-l ${WORK_DIR}/labelindex \
-ow -o ${WORK_DIR}/20news-testing $c
public Vector classifyFull(Vector r, Vector instance) { for (int label = 0; label < model.numLabels(); label++) { r.setQuick(label, getScoreForLabelInstance(label, instance)); } return r; } protected double getScoreForLabelInstance(int label, Vector instance) { double result = 0.0; for (Element e : instance.nonZeroes()) { result += e.get() * getScoreForLabelFeature(label, e.index()); } return result; } @Override public double getScoreForLabelFeature(int label, int feature) { NaiveBayesModel model = getModel(); return computeWeight(model.weight(label, feature), model.labelWeight(label), model.alphaI(), model.numFeatures()); } public static double computeWeight(double featureLabelWeight, double labelWeight, double alphaI, double numFeatures) { double numerator = featureLabelWeight + alphaI; double denominator = labelWeight + alphaI * numFeatures; return Math.log(numerator / denominator); } |
根据理论部分的公式,计算每个文档相应类的似然值。本作品采用知识共享署名 4.0 国际许可协议进行许可。