Knowledge Based Search Engine

bucketwastefulΛογισμικό & κατασκευή λογ/κού

25 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

93 εμφανίσεις

CS267


Semester Fall 2010



Knowledge Based Search Engine


Project Report and Description






Submitted By:

Manandeep Singh Bedi

Class Id: 306

SJSU ID: 006491940




Project Outline

20_Newsgroup data was converted to Txt format and than tokenized to get
the desired result.

Tokenization

Given a character sequence and a defined document unit, tokenization is the task of chopping it
up into pieces, called

tokens

, perhaps at the same time throwing away certain characters, such as
punctuation. Here is an
example of tokenization:

Input: Friends, Romans, Countrymen, lend me your ears;


Output:

FRIENDS, ROMANS, COUNTRYMEN, LEND, ME, YOUR, EARS


These tokens are often loosely referred to as terms or words, but it is sometimes important to
make a type/token

d
istinction. A

token

is an instance of a sequence of characters in some
particular document that are grouped together as a useful semantic unit for processing. A

type

is
the class of all tokens containing the same character sequence. A

term

is a (perhaps no
rmalized)
type that is included in the IR system's dictionary. The set of index terms could be entirely
distinct from the tokens, for instance, they could be semantic identifiers in a taxonomy, but in
practice in modern IR systems they are strongly related

to the tokens in the document. However,
rather than being exactly the tokens that appear in the document, they are usually derived from
them by various normalization processes.

For example, if the document to be indexed is to sleep
perchance to dream, the
n there are 5 tokens, but only 4 types (since there are 2 instances of to).
However, if to is omitted from the index, then there will be only 3 terms:

sleep
,

perchance
,
and

dream
.

Tokenization comes from computer science and it refers to simplifying a data

set by replacing
complex data structures (such as words) with simpler ones (such as numbers). In short, you
create a cypher over a set of words or expressions where the words or e
xpressions are
represented by numbers.
I have seen people discuss tokenizati
on in two contexts with respect to
search engines. Pages or documents can be tokenized (assigned a unique identifier, an ID
number) and words within documents can be tokenized (also assigned a unique identifier, but
o
ne which is used in a different
c
ontext

from the page identifier
.

The approach we have implemented here is:

1.

Read File from Secondary storage one at a time from the path Directory

2.

Reading char by char from each file and when encountering the delimiter, token is
generated.

3.

The Token is
then put into the output file named “tokens” in the project


TF/IDF:

After tokenizing Term frquency and Inverse document frequency was calculated:

Term frequency is a
measure of how often a term is found in a

collection

of documents. TF is
combined with

inverse document frequency

(IDF) as a means of determining which documents
are most

relevant

to a

query
. TF is sometimes also used to measure how often a word appears in
a specific document

Th
e

tf
-
idf

weight (term frequency
-
inverse document frequency) is a weight often used
in

information retrieval

and

text mining
. This weight is a statistical measure used to evaluate how
important a word is to a

document

in a collection or

corpus
. The importance
increases

proportionally

to the number of times a word appears in the document but is

offset by
the frequency of the word in the corpus. Variations of the tf
-
idf weighting scheme are often used
by

search engines

as a central tool in scoring and ranking a
document's

relevance
given a
user

query
.

One of the simplest

ranking functions

is computed by summing the tf
-
idf for each query term;
many more sophisticated ranking functions are variants of this simple model.

Sup
pose we have a set of English text documents and wish to determine which document is most
relevant to the query "the brown cow." A simple way to start out is by eliminating documents
that do not contain all three words "the," "brown," and "cow," but this s
till leaves many
documents. To further distinguish them, we might count the number of times each term occurs in
each document and sum them all together; the number of times a term occurs in a document is
called its

term frequency
. However, because the term

"the" is so common, this will tend to
incorrectly emphasize documents which happen to use the word "the" more, without giving
enough weight to the more meaningful terms "brown" and "cow". Also the term "the" is not a
good keyword to distinguish relevant a
nd non
-
relevant documents and terms like "brown" and
"cow" that occur rarely are good keywords to distinguish relevant documents from the non
-
relevant documents. Hence an

inverse document frequency

factor is incorporated which
diminishes the weight of term
s that occur very frequently in the collection and increases the
weight of terms that occur rarely.


Mathematical details

The

term count

in the given document is simply the number of times a given

term

appears in that
document. This count is usually normalized to prevent a bias towards longer documents (which
may have a higher term count regardless of the actual importance of that term in the document)
to give a measure of

the importance of the term

t
i

within the particular document

d
j
. Thus we
have the

term frequency
, defined as follows.


where

n
i
,
j

is the number of occurrences of the considered term (
t
i
) in document

d
j
, and the
denominator is the sum of number of
occurrences of all terms in document

d
j
.

The

inverse document frequency

is a measure of the general importance of the term (obtained by
dividing the number of all

documents

by the number

of documents containing the term, and then
taking the

logarithm

of that

quotient
).


with



|

D

|

: total number of
documents in the corpus




: number of documents where the term

t
i

appears (that is

). If
the term is not in the corpus, this will lead to a division
-
by
-
zero. It is therefore common
to use


Then


A high weight in tf
-
idf is reached by a high term

frequency

(in the given document) and a low
document frequency of the term in the whole collection of documents; the weights hence tend to
filter out common
terms. The tf
-
idf value for a term will always be greater than or equal to zero.

Stopwords:

Stopwords are common words that carry less important meaning than keywords. Usually search
engines remove stopwords from a keyword phrase to return the most relevan
t result. I.e.
stopwords drive much less traffic than keywords.

Stopwords is a part of human language and there’s nothing you can do about it. Sure, but high
stopword density can make your content look less important for search engines.

All stop words, for

example, common words, such as

a

and

the
, are removed from multiple word
queries to increase search performance.Stop word recognition in Japanese is based on
grammatical information, for example,

IBM® Cognos® Content Analytics

recognizes whether
the word
is a noun or a verb. For the other languages, the system uses special lists.

No stop words are removed during query processing if:



All of the words in a query are stop words. If all the query terms are removed during stop
word processing, then the result s
et is empty. To ensure that search results are returned,
stop word removal is disabled when all of the query terms are stop words. For example, if
the word

car

is a stop word and you search for

car
, then the search results contain
documents that match the
word

car
. If you search for

car buick
, the search results contain
only documents that match the word

buick
.



The word in a query is preceded by the plus sign (+).



The word is part of an exact match.



The word is inside a phrase, for example, "I love my car"


Look at the picture below. There are two paragraphs from above without stopwords.



Text is shorter than the original one: 66 words versus 31. Approximately 50 percent of words are
stopwords. I.e. half of the text is not really important for search engin
es.

Stop word Elimination Script:

while(f >> str)

{


//reading file and inserting words into myvector

tokens.insert(tokens.end(),str);


// cout<<" " <<str;

}

f.close();

for(it=tokens.begin(); it< tokens.end(); it++)

{

p++;

pos++;

int temp;

char delim[]="
\
n";

string a ="";

a=*it;

// char *str1;

// *g = a[0];

// str1= a[0];

int no_stopwords;

no_stopwords= sizeof(stopwords)/sizeof(stopwords[0]);

if(a[0] >= '0' && a[0] <= '9' )

{continue;}



for(int q=0;q<no_stopwords;q++)

{


if(stopwords[q] == *it ){

goto la
b;

}


}

Word
Position representation in Document:

The positional representation is a simple extension of a classic bag
-
of
-
words representation,
which stores not only information about word occurrence frequency, but also information about
relative positions

of words in a document. Positional representation of a document D=(w1, w2,
..., z1, ..., wn, zm) is a pair (F, S) where F is a set of word density functions fVi such that their
domain is a set {1..n} and values are defined as follows:



while S is a scal
ing vector with same values as in unigram representations.

Building positional representation for a document is not a complex task, and involves using a
window of length 2r „sliding” over document’s contents, counting occurences of words
contained within t
he window, and normalizing count values afterwards.

Search Criteria

We have search criteria which is an optimal approach used to develop a knowledge based search
engine, that name is defined as
proximity search.

That looks for the documents where two or
m
ore term occurrences are separated by matching pattern within a specified distance, where
distance is the number of intermediate words or characters. In addition to proximity, some
implementations may also impose a constraint on the word order, in that th
e order in the
searched text must be identical to the order of the search query

Two
-
key Word searching :

After calculating the Term Frequency and Inverse Document frequency of the given data we set
the valuse in such a way to get just the TF/IDF valuse in

a document. After Getting TF/Idf value
we set the limit of repeating words in a document to 30 which means

for pairing of words we
selected the limit 30
, if the word repeats within the next 30 words it is selected to

form pair with
other desired word and fulfill

Two
-
keyword search scenario. Example

If first word is “MANAN” and other word is “DEEP” and DEEP appears within the next 30
words after MANAN the word MANAN DEEP is selected and copied to another document.

Her
e two words are combined to form 2 keywords which is considered as the complete words.

Code:

TF/IDF code:

// This code reads each of the files from the directory data set provided. Tokenises each file and computes the TF(Term
frequency)

// The tokens are
then used to calculate the IDF(Inverse Document Frequncy), which computes the document count
containing each of the tokens. this is recorded in the files in the modified folder.

#include <sys/types.h>

#include <dirent.h>

#include <errno.h>

#include <vector
>

#include <string.h>

#include <iostream>

#include <fstream>

#include <math.h>

#include<conio.h>

//#include "common.h"

using namespace std;


int getdir (string dir, vector<string> &files) // opening the reuters directory to get access to files

{


DIR *d
p;


struct dirent *dirp;


if((dp = opendir(dir.c_str())) == NULL) {


cout << "Error(" << errno << ") opening " << dir << endl;


return errno;


}



while ((dirp = readdir(dp)) != NULL) {


files.push_back(string(dirp
-
>
d_name));


}


closedir(dp);


return 0;

}


int main()

{


int i;


string dir = string("D:
\
\
reuters_1000
\
\
reuters_1000
\
\
reuters_1000");



// string dir1 = string("H:
\
\
reuters_1000
\
\
modified
\
\
");


vector<string> files = vector<string>();


vector<string> tokens = vector<string>();






vector<string>::iterator it;


char str[1024];


char str1[1024];




getdir(dir,files);



for (unsigned int i = 0;i < files.size();i++) {


int p=0;


int j=0;


int pos = 0;


int wordcount = 1;


int doccount = 1; // calculating position


vector<string>::iterator it;


vector<string>::iterator it2;


vector<string>::iterator it1;


// string dir1 = string("C:
\
\
Users
\
\
Comp
uter
\
\
Desktop
\
\
reuters_1000
\
\
rmodified
\
\
");


string dir = string("D:
\
\
reuters_1000
\
\
reuters_1000
\
\
reuters_1000");


string dircpy = string("D:
\
\
reuters_1000
\
\
reuters_1000
\
\
reuters_1000");


string dir1 = string("D:
\
\
reuters_1000
\
\
reute
rs_1000
\
\
modified");


string dir2 = string("D:
\
\
reuters_1000
\
\
reuters_1000
\
\
final
\
\
final");




// cout << files[i] << endl;


string b;


string a;




a= files[i]; // appending files to vectors


b= files[i]
;


// dir1.append(b);


dir.append(a);


dir1.append(a);


dir2.append(a);





ofstream t,z;


ifstream f;




f.open(dir.c_str());






while(f >> str)


{




//reading file and inserting words into myvector


tokens.insert(tokens.end(),str);


wordcount++;




// cout<<" " <<str;


}


for(it=tokens.begin();it<tokens.end();it++)


{




for (unsigned i
nt s = 0;s < files.size();s++) {


string c = files[s];


if(c == b)


{continue;}





dircpy.append(c);


ifstream f;


f.open(dircpy.c_str()); //

checkin all files if word is repeated to cal idf




while (f >> str) // checkin whole file for word



{


if(*it == str)


doccount++;


break;






}





}




// calculating tf
-
idf


long double tf,idf,tfidf;





int count=1;


/* // checking if word is repeated previously


for(it2=myvector.begin() ; it2< it ; it2++)



{ if(*it2 == *it)


{goto end;}


}



//copunting the frequency */




for(it1 = it + 1; it1 < tokens.end() ; it1++)


{


if(*i
t1 == *it){count++;}




}




tf = float(count)/float(wordcount); // calculate tf
-
idf and eliminate idf values and corresposnding
tokens that are less than the hardcoded threshhold.


idf = float(log(1000)/float(
doccount));


tfidf = float(float(tf) * float(idf));


ofstream op,ofs;


op.open(dir1.c_str(),ios_base::app);


ofs.open(dir2.c_str(),ios_base::app);


op << *it <<"
\
t" <<tfidf <<"
\
n";


if(tfidf < 0.15){


ofs << *it << "
\
n";


}






op.close();














}












tokens.clear();


getch();


} //total files reading end

}//main end


op.close();

code for TRY.cpp

//
THIS PROGRAM TAKES IN
FILTERD KEYWORD INPUT AND FURTHER FILTERS STOPWORDS, SPECIAL
CHARACTERS, NUMBERS.

// fINAL OUTPUT IS 4 KEYWORD FOUND IN FILE BASED ON POSITION. THE POSITION SELECTED IS 30 BETWEEN SET
OF TWO CONSECUTIVE KEYWORDS.

// The position is hardcoded in program.


#
include <sys/types.h>

#include <dirent.h>

#include <errno.h>

#include <vector>

#include <string.h>

#include <iostream>

#include <fstream>

#include "common.h"

//#include "utils.c"

static int w = 0;


using namespace std;

// hardcoding stopwords

char
stopwords[][1024]={"a" , "able" , "about" , "above" , "abroad" , "according" , "accordingly" , "across" , "actually" , "adj"
,
"after" ,












"afterwards" , "again" , "against" ,
"ago" , "ahead" , "ain't" , "all" , "allow" , "allows" , "almost" , "al
one" ,










"along" , "alongside" , "already" , "also" ,
"although" , "always" , "am" , "amid" , "amidst" , "among" , "amongst" ,










"an" , "and" , "another" , "any" , "anybody" ,
"anyhow" , "anyone" , "anything" , "anyway" , "anyways" , "anywh
ere" ,










"apart" , "appear" , "appreciate" ,
"appropriate" , "are" , "aren't" , "around" , "as" , "a's" , "aside" , "ask" ,










"asking" , "associated" , "at" ,
"available" , "away" , "awfully" , "b" , "back" , "backward" , "backwards" ,










"be" , "became" , "because" , "become" ,
"becomes" , "becoming" , "been" , "before" , "beforehand" , "begin" ,










"behind" , "being" , "believe" , "below" ,
"beside" , "besides" , "best" , "better" , "between" , "beyond" , "both" ,










"brief" , "but" , "by" , "c" , "came" ,
"can" , "cannot" , "cant" , "can't" , "caption" , "cause" , "causes" ,










"certain" , "certainly" , "changes" ,
"clearly" , "c'mon" , "co" , "co." ,"Co","Ltd","com" , "come" , "comes" , "concerning" ,










"consequently" , "consider" ,
"considering" , "contain" , "containing" , "contains" , "corresponding" , "could" ,










"couldn't" , "course" , "c's" ,
"currently" , "d" , "dare" , "daren't" , "definitely" , "described" , "despite" ,










"did" , "d
idn't" , "different" , "directly"
, "do" , "does" , "doesn't" , "doing" , "done" , "don't" , "down" ,










"downwards" , "during" , "e" , "each" ,
"edu" , "eg" , "eight" , "eighty" , "either" , "else" , "elsewhere" ,










"end" , "ending" , "enough
" , "entirely"
, "especially" , "et" , "etc" , "even" , "ever" , "evermore" , "every" ,










"everybody" , "everyone" , "everything"
, "everywhere" , "ex" , "exactly" , "example" , "except" , "f" , "fairly" ,










"far" , "farther" , "few" , "fewer
" ,
"fifth" , "first" , "five" , "followed" , "following" , "follows" , "for" ,










"forever" , "former" , "formerly" ,
"forth" , "forward" , "found" , "four" , "from" , "further" , "furthermore" ,










"g" , "get" , "gets" , "getting" , "given" ,

"gives" , "go" , "goes" , "going" , "gone" , "got" , "gotten" ,










"greetings" , "h" , "had" , "hadn't" ,
"half" , "happens" , "hardly" , "has" , "hasn't" , "have" , "haven't" ,










"having" , "he" , "he'd" , "he'll" ,
"hello" , "help" , "hence
" , "her" , "here" , "hereafter" , "hereby" ,










"herein" , "here's" , "hereupon" , "hers"
, "herself" , "he's" , "hi" , "him" , "himself" , "his" , "hither" ,










"hopefully" , "how" , "howbeit" ,
"however" , "hundred" , "i" , "i'd" , "ie" , "i
f" , "ignored" , "i'll" , "i'm" ,










"immediate" , "in" , "inasmuch" , "inc" ,
"inc." , "indeed" , "indicate" , "indicated" , "indicates" , "inner" ,










"inside" , "insofar" , "instead" ,
"into","Inc" , "inward" , "is" , "isn't" , "it" , "it'd"

, "it'll" , "its" , "it's" ,










"itself" , "i've" , "j" , "just" , "k" ,
"keep" , "keeps" , "kept" , "know" , "known" , "knows" , "l" , "last" ,










"lately" , "later" , "latter" , "latterly" ,
"least" , "less" , "lest" , "let" , "let's" , "lik
e" , "liked" , "likely" ,










"likewise" , "little" , "look" , "looking" ,
"looks" , "low" , "lower" , "ltd" , "m" , "made" , "mainly" , "make" ,










"makes" , "many" , "may" , "maybe" ,
"mayn't" , "me" , "mean" , "meantime" , "meanwhile" , "mere
ly" , "might" ,










"mightn't" , "mine" , "minus" , "miss" ,
"more" , "moreover" , "most" , "mostly" , "mr" , "mrs" , "much" , "must" ,










"mustn't" , "my" , "myself" , "n" ,
"name" , "namely" , "nd" , "near" , "nearly" , "necessary" , "need" ,
"needn't" ,










"needs" , "neither" , "never" , "neverf" ,
"neverless" , "nevertheless" , "new" , "next" , "nine" , "ninety" , "no" ,










"nobody" , "non" , "none" ,
"nonetheless" , "noone" , "no
-
one" , "nor" , "normally" , "not" , "nothing" , "
notwithstanding" ,










"novel" , "now" , "nowhere" , "o" ,
"obviously" , "of" , "off" , "often" , "oh" , "ok" , "okay" , "old" , "on" , "once" ,










"one" , "ones" , "one's" , "only" , "onto"
, "opposite" , "or" , "other" , "others" , "otherwise"

, "ought" , "oughtn't" ,










"our" , "ours" , "ourselves" , "out" ,
"outside" , "over" , "overall" , "own" , "p" , "particular" , "particularly" ,










"past" , "per" , "perhaps" , "placed" ,
"please" , "plus" , "possible" , "presumably" ,
"probably" , "provided" ,










"provides" , "q" , "que" , "quite" , "qv"
, "r" , "rather" , "rd" , "re" , "really" , "reasonably" , "recent" , "recently" ,










"regarding" , "regardless" , "regards" ,
"relatively" , "respectively" , "right" , "rou
nd" , "s" , "said" , "same" , "saw" ,










"say" , "saying" , "says" , "second" ,
"secondly" , "see" , "seeing" , "seem" , "seemed" , "seeming" , "seems" , "seen" ,










"self" , "selves" , "sensible" , "sent" ,
"serious" , "seriously" , "seven" ,
"several" , "shall" , "shan't" , "she" ,










"she'd" , "she'll" , "she's" , "should" ,
"shouldn't" , "since" , "six" , "so" , "some" , "somebody" , "someday" ,










"somehow" , "someone" , "something" ,
"sometime" , "sometimes" , "somewhat" , "som
ewhere" , "soon" , "sorry" , "specified" ,










"specify" , "specifying" , "still" , "sub" ,
"such" , "sup" , "sure" , "t" , "take" , "taken" , "taking" , "tell" , "tends" ,










"th" , "than" , "thank" , "thanks" ,
"thanx" , "that" , "that'll" , "
thats" , "that's" , "that've" , "the" ,"The", "their" , "theirs" ,










"them" , "themselves" , "then" ,
"thence" , "there" , "thereafter" , "thereby" , "there'd" , "therefore" , "therein" , "there'll" ,










"there're" , "theres" , "there's" ,
"th
ereupon" , "there've" , "these" , "they" , "they'd" , "they'll" , "they're" , "they've" ,










"thing" , "things" , "think" , "third" ,
"thirty" , "this" , "thorough" , "thoroughly" , "those" , "though" , "three" , "through" ,










"throughout" , "
thru" , "thus" , "till" ,
"to" , "together" , "too" , "took" , "toward" , "towards" , "tried" , "tries" , "truly" ,










"try" , "trying" , "t's" , "twice" , "two" ,
"u" , "un" , "under" , "underneath" , "undoing" , "unfortunately" , "unless" , "unlike
" ,










"unlikely" , "until" , "unto" , "up" ,
"upon" , "upwards" , "us" , "use" , "used" , "useful" , "uses" , "using" , "usually" ,










"v" , "value" , "various" , "versus" ,
"very" , "via" , "viz" , "vs" , "w" , "want" , "wants" , "was" ,










"wasn't" , "way" , "we" , "we'd" ,
"welcome" , "well" , "we'll" , "went" , "were" , "we're" , "weren't" ,










"we've" , "what" , "whatever" ,
"what'll" , "what's" , "what've" , "when" , "whence" , "whenever" , "where" ,










"whereafter" , "w
hereas" , "whereby" ,
"wherein" , "where's" , "whereupon" , "wherever" , "whether" , "which" ,










"whichever" , "while" , "whilst" ,
"whither" , "who" , "who'd" , "whoever" , "whole" , "who'll" , "whom" ,










"whomever" , "who's" , "whose" , "wh
y"
, "will" , "willing" , "wish" , "with" , "within" , "without" , "wonder" ,










"won't" , "would" , "wouldn't" , "x" ,
"y" , "yes" , "yet" , "you" , "you'd" , "you'll" , "your" , "you're" ,










"yours" , "yourself" , "yourselves" ,
"you've" , "
z" , "zero"};

/*function... might want it in some class?*/

// to check directory for files.

int getdir (string dir, vector<string> &files)

{


DIR *dp;


struct dirent *dirp;


if((dp = opendir(dir.c_str())) == NULL) {


cout << "Error(" <<

errno << ") opening " << dir << endl;


return errno;


}



while ((dirp = readdir(dp)) != NULL) {


files.push_back(string(dirp
-
>d_name));


}


closedir(dp);


return 0;

}


int main()

{


int i;


string dir = string("H:
\
\
reut
ers_1000
\
\
reuters_1000
\
\
");



string dir1 = string("H:
\
\
reuters_1000
\
\
modified
\
\
");


vector<string> files = vector<string>();


vector<string> tokens = vector<string>();




vector<string>::iterator it;


char str[1024];




getdir(dir,files);



for (unsigned int i = 0;i < files.size();i++) {


int p=0;


int j=0;


int pos = 0; // calculating position




string dir1 = string("H:
\
\
reuters_1000
\
\
modified
\
\
");


string dir = string("
H:
\
\
reuters_1000
\
\
final
\
\
");


// cout << files[i] << endl;


string b;


string a;




a= files[i];


b= files[i];


dir1.append(b);


dir.append(a);





ofstream t,op2;


ifstream f;


f
.open(dir.c_str());






while(f >> str)


{




//reading file and inserting words into myvector


tokens.insert(tokens.end(),str);




// cout<<" " <<str;


}


f.close();


for(it=tokens.begin(); it< tokens.end(); it++)


{


p++;



pos++;


int temp;


char delim[]="
\
n";


string a ="";


a=*it;





// removing stopwords from input


int no_stopwords;



no_stopwords= sizeof(stopwords)/sizeof(stopwords[0]);


if(a[0] >= '0' && a[0] <= '9' )


{continue;}














for(int q=0;q<no_stopwords;q++)




{






if(stopwords[q] == *it ){




goto lab;









}






















}








// removing special characters


for(int k=0; k < a.length();k++)


{


if(!((a[k] >= 'a') && (a[k] <= 'z') || (a[k] >= 'A') && (a[k] <= 'Z')) ){goto lab;}



}






// if((j
-
p)<5)




// {




// based on position pairing 4 keywords and storing in output files



op2.open("H:
\
\
reuters_1000
\
\
op1.txt",ios::app);


// if(pos
-

temp <=30){


w++;






op2<< *it <<"
\
t" ;














if(w %4 ==0)







{


op2 << files[i];



op2 << "
\
n";


}


// }


temp = pos;


lab:


op2.close();




}




tokens.erase(tokens.begin(),tokens.end());


f.close();






}


cin.get();


return 0;


// cin.get();

}

<< files[i];


op2 << "
\
n";


}


// }


temp = pos;


lab:


op2.close();




}




tokens.erase(tokens.begin(),tokens.end());


f.close();