课外天地 李树青学习天地信息检索原理课件 → 基于面向对象模型的分词程序


  共有21469人关注过本帖树形打印复制链接

主题:基于面向对象模型的分词程序

帅哥哟,离线,有人找我吗?
admin
  1楼 博客 | 信息 | 搜索 | 邮箱 | 主页 | UC


加好友 发短信 管理员
等级:管理员 帖子:1940 积分:26616 威望:0 精华:34 注册:2003/12/30 16:34:32
基于面向对象模型的分词程序  发帖心情 Post By:2008/4/11 5:59:49 [只看该作者]

只是一个初步的构架,大家可以继续完善,如加入求向量空间相似度的功能等。

 

public class VSM {
        public static void main(String[] args) {
                // get documents
                String[] docs = {
                                "The search trees overcome many issues of hash dictionary",
                                "There are many different implementations of the Java Platform running at a variety of operating systems.",
                                "Applet is a Java class that can be Embedded within an HTML page and downloaded and executed by a Web browser",
                                "Java programming language defines eight primitive types",
                                "The traditional crawlers used by search engines to build their collection of Web pages frequently gather unmodified pages that already exist in their collection",
                                "The activities of many users on an Information Retrieval System(IRS) are of tenvery similar because they have similar preferences or related interest",
                                "Research in Information Retrieval can be categorized along multiple dimensions, focusing, for example,on the technical paradigm, the research field, the targeted document type, or the application domain." };
                Collection allDocs=new Collection(docs);
                allDocs.process();
                allDocs.printDocuments();
        }
}

 

class Collection
{
        private Document[] documents;
        public static String[] stopList = { "an", "and", "are", "as", "at", "be", "by",
                        "for", "from", "has", "he", "in", "is", "it", "its", "of",
                        "on", "that", "the", "to", "was", "were", "will", "with" };
       
        public Collection()
        {
               
        }
       
        public Collection(String [] docs)
        {
                setDocuments(docs);                    
        }
       
        public void process()
        {
                java.util.Arrays.sort(stopList);
                for (int i = 0; i < documents.length; i++)
                {
                        documents.computeTerms();
                }
                computeIDF();
        }

 

        public Document[] getDocuments() {
                return documents;
        }

 

        public void setDocuments(Document[] docs) {
                this.documents = docs;
        }
       
        public void setDocuments(String[] docs) {
                documents=new Document[docs.length];
                for(int i=0;i<documents.length;i++)
                {
                        documents=new Document();
                        documents.setContent(docs);
                }      
        }
       
        public void computeIDF()
        {
                for(int i=0;i<documents.length;i++)
                {
                        int count=0;
                        for(int j=0;j<documents.getTerms().size();j++)
                        {
                                Term t=(Term)(documents.getTerms().get(j));
                                for(int k=0;k<documents.length;k++)
                                {
                                        if(documents[k].findTerm(t))
                                                count++;                                       
                                }                                      
                        }
                }
        }
       
        public void printDocuments()
        {
                for(int i=0;i<documents.length;i++)
                {
                        documents.printDocument();
                }
        }
}

 

class Document
{
        private java.util.ArrayList terms=new  java.util.ArrayList();
        private String content;
       
        public String getContent() {
                return content;
        }
        public void setContent(String content) {
                this.content=content;
        }      
       
        public boolean findTerm(Term t)
        {      
                if(terms.indexOf(t)<0)
                        return false;
                else
                        return true;
        }
       
        public void computeTerms()
        {
                String[] tokens = content.toLowerCase().split("\\W");
                for (int i = 0; i < tokens.length; i++) {
                        if (!(tokens.equals("") || tokens.length() == 1))
                                if (java.util.Arrays.binarySearch(Collection.stopList, tokens) < 0)
                                {
                                        Term t=new Term();
                                        t.setTerm(tokens);
                                       
                                        int index=terms.indexOf(t);
                                        if(index<0)
                                                terms.add(t);
                                        else
                                        {
                                                Term tone=(Term)(terms.get(index));
                                                tone.setTf(tone.getTf()+1);
                                        }                              
                                }      
                }
               
        }
       
        public void printDocument()
        {
                for (int j = 0; j < terms.size(); j++)
                {
                        Term t=(Term)terms.get(j);
                        System.out.print(t.getTerm()+"("+ t.getTf() +")" + "\t");                      
                }
                System.out.println();          
        }
        public java.util.ArrayList getTerms() {
                return terms;
        }
        public void setTerms(java.util.ArrayList terms) {
                this.terms = terms;
        }      
}

 

class Term
{
        private String term;
        private int frequecy;
        private int tf=1;
        private double idf;
       
        public int getFrequecy() {
                return frequecy;               
        }
        public void setFrequecy(int frequecy) {
                this.frequecy = frequecy;
        }
        public String getTerm() {
                return term;
        }
        public void setTerm(String term) {
                this.term = term;
        }
        public double getIdf() {
                return idf;
        }
        public void setIdf(double idf) {
                this.idf = idf;
        }
        public int getTf() {
                return tf;
        }
        public void setTf(int tf) {
                this.tf = tf;
        }
       
        public boolean equals(Object t)
        {
                if(this.term.equals(((Term)t).getTerm()))
                        return true;
                else
                        return false;
        }
        
}

[此贴子已经被作者于2010-12-14 09:18:25编辑过]

 回到顶部