|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.egothor.duplicity.file.DuplicityCheckingFile
org.egothor.duplicity.file.JaccardCoeficientsFile
public class JaccardCoeficientsFile
Represents the file of Jaccard coeficients, or more exactly represents an
aggregated "similar unit pairs" files of type AllSimilarUnitPairsFile
.
Just like the "similar unit pairs" files this file contains only pairs where a < b.
The file contains instances of JaccardCoeficient
class.
That means it contains triples {first, second, num}, where first, second are identificators of units
on which we check duplicity (can be document, paragraph or sentence)
and num is the number of occurences of the pair in underlying
AllSimilarUnitPairsFile
, which is the
number of permutations according which the two units are similar.
The file is sorted - the main criteria is first field, in case of tie second field.
The file should be used as follows.
CommonSimilarUnitPairsFile
using AllSimilarUnitPairsFile.mergeToJaccardCoeficientsFile(java.util.ArrayList)
or read from filesystem using constructor(String location). merge(org.egothor.duplicity.file.JaccardCoeficientsFile, org.egothor.duplicity.file.JaccardCoeficientsFile)
method.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.egothor.duplicity.file.DuplicityCheckingFile |
---|
DuplicityCheckingFile.TempFile |
Field Summary | |
---|---|
protected java.util.ArrayList<java.lang.Integer> |
endedStreams
Helper value for the merge() method. |
Fields inherited from class org.egothor.duplicity.file.DuplicityCheckingFile |
---|
location, out |
Constructor Summary | |
---|---|
JaccardCoeficientsFile(java.lang.String location)
|
Method Summary | |
---|---|
protected void |
createOut()
Creates permanent file and sets the out field. |
java.lang.String |
dump()
Dumps the file with its content to String. |
java.util.Map<TextUnitID,JaccardCoeficient> |
filterRelevantForDocument(DocumentUnitID doc)
Filter from the file only the entries relevant for given document. |
java.lang.String |
getFilename()
Returns the filename corresponding to this file. |
boolean |
hasTheSameContent(DuplicityCheckingFile file)
Checks if two files has the same content. |
java.util.Map<DocumentUnitID,java.lang.Double> |
markDuplicates(java.util.List<DocumentData> docs)
|
void |
merge(JaccardCoeficientsFile jcf1,
JaccardCoeficientsFile jcf2)
Merges files externally on filesystem. |
protected void |
mergeAll(DataOutputStream dos,
java.util.ArrayList<DataInputStream> diss,
java.util.ArrayList<JaccardCoeficient> jcs)
Merges all given input streams to given output stream. |
protected void |
mergeAll(JaccardCoeficientsFile mergeTo,
java.util.ArrayList<JaccardCoeficientsFile> jcfs)
Merges multiple files externally, on filesystem. |
void |
remove(java.util.Set<DocumentUnitID> toRemove)
Removes all occurences of documents given in the set from the file. |
Methods inherited from class org.egothor.duplicity.file.DuplicityCheckingFile |
---|
createPermOut, createTempOut, delete, dump, getLocation, getNewTempFile, getOut, hasTheSameContent, initFromProducer, openOut, remove, toString |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
protected java.util.ArrayList<java.lang.Integer> endedStreams
Constructor Detail |
---|
public JaccardCoeficientsFile(java.lang.String location) throws java.io.IOException, DuplicityCheckingException
java.io.IOException
DuplicityCheckingException
Method Detail |
---|
public void merge(JaccardCoeficientsFile jcf1, JaccardCoeficientsFile jcf2) throws java.io.IOException
mergeAll(org.egothor.duplicity.file.JaccardCoeficientsFile, java.util.ArrayList)
.
jcf1
- a file to be merged into thisjcf2
- a file to be merged into this
java.io.IOException
- if temporary file could not be createdmergeAll(org.egothor.duplicity.file.JaccardCoeficientsFile, java.util.ArrayList)
protected void mergeAll(JaccardCoeficientsFile mergeTo, java.util.ArrayList<JaccardCoeficientsFile> jcfs) throws java.io.IOException
mergeTo
- file where the result will be placedjcfs
- list of files to be merged. Can be temporal or permanent
JaccardCoeficientsFile
.
java.io.IOException
- if temporary file could not be createdprotected void mergeAll(DataOutputStream dos, java.util.ArrayList<DataInputStream> diss, java.util.ArrayList<JaccardCoeficient> jcs) throws java.io.IOException
dos
- DataOutputStream to which to write outputdiss
- list of DataInputStreams to be mergedjcs
- leading elements of streams
java.io.IOException
- on error reading/writing from/to streampublic java.util.Map<DocumentUnitID,java.lang.Double> markDuplicates(java.util.List<DocumentData> docs) throws java.io.IOException, DuplicityCheckingException
java.io.IOException
DuplicityCheckingException
public java.lang.String getFilename()
Constants.JACCARD_COEFICIENTS_FILE_NAME
.
getFilename
in class DuplicityCheckingFile
protected void createOut() throws java.io.IOException
DuplicityCheckingFile.createPermOut()
method.
createOut
in class DuplicityCheckingFile
java.io.IOException
- if the file already exists or
could not be createdDuplicityCheckingFile.createPermOut()
public java.lang.String dump()
public void remove(java.util.Set<DocumentUnitID> toRemove) throws java.io.IOException
toRemove
- set of document ids to remove
java.io.IOException
public java.util.Map<TextUnitID,JaccardCoeficient> filterRelevantForDocument(DocumentUnitID doc) throws java.io.IOException
Constants.SIMILARITY_RELEVANT_TRESHOLD
).
doc
- document for which the relevant entries are requested
java.io.IOException
- on error while reading filesystem filepublic boolean hasTheSameContent(DuplicityCheckingFile file)
file
- the second file to be tested
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |