org.egothor.duplicity.file
Class JaccardCoeficientsFile

java.lang.Object
  extended by org.egothor.duplicity.file.DuplicityCheckingFile
      extended by org.egothor.duplicity.file.JaccardCoeficientsFile

public class JaccardCoeficientsFile
extends DuplicityCheckingFile

Represents the file of Jaccard coeficients, or more exactly represents an aggregated "similar unit pairs" files of type AllSimilarUnitPairsFile. Just like the "similar unit pairs" files this file contains only pairs where a < b. The file contains instances of JaccardCoeficient class. That means it contains triples {first, second, num}, where first, second are identificators of units on which we check duplicity (can be document, paragraph or sentence) and num is the number of occurences of the pair in underlying AllSimilarUnitPairsFile, which is the number of permutations according which the two units are similar. The file is sorted - the main criteria is first field, in case of tie second field.

The file should be used as follows.

  1. First it should be created by merging CommonSimilarUnitPairsFile using AllSimilarUnitPairsFile.mergeToJaccardCoeficientsFile(java.util.ArrayList) or read from filesystem using constructor(String location).
  2. It can be merged with another files of Jaccard coeficients by a call to merge(org.egothor.duplicity.file.JaccardCoeficientsFile, org.egothor.duplicity.file.JaccardCoeficientsFile) method.

Author:
Kate�ina Dufkov�

Nested Class Summary
 
Nested classes/interfaces inherited from class org.egothor.duplicity.file.DuplicityCheckingFile
DuplicityCheckingFile.TempFile
 
Field Summary
protected  java.util.ArrayList<java.lang.Integer> endedStreams
          Helper value for the merge() method.
 
Fields inherited from class org.egothor.duplicity.file.DuplicityCheckingFile
location, out
 
Constructor Summary
JaccardCoeficientsFile(java.lang.String location)
           
 
Method Summary
protected  void createOut()
          Creates permanent file and sets the out field.
 java.lang.String dump()
          Dumps the file with its content to String.
 java.util.Map<TextUnitID,JaccardCoeficient> filterRelevantForDocument(DocumentUnitID doc)
          Filter from the file only the entries relevant for given document.
 java.lang.String getFilename()
          Returns the filename corresponding to this file.
 boolean hasTheSameContent(DuplicityCheckingFile file)
          Checks if two files has the same content.
 java.util.Map<DocumentUnitID,java.lang.Double> markDuplicates(java.util.List<DocumentData> docs)
           
 void merge(JaccardCoeficientsFile jcf1, JaccardCoeficientsFile jcf2)
          Merges files externally on filesystem.
protected  void mergeAll(DataOutputStream dos, java.util.ArrayList<DataInputStream> diss, java.util.ArrayList<JaccardCoeficient> jcs)
          Merges all given input streams to given output stream.
protected  void mergeAll(JaccardCoeficientsFile mergeTo, java.util.ArrayList<JaccardCoeficientsFile> jcfs)
          Merges multiple files externally, on filesystem.
 void remove(java.util.Set<DocumentUnitID> toRemove)
          Removes all occurences of documents given in the set from the file.
 
Methods inherited from class org.egothor.duplicity.file.DuplicityCheckingFile
createPermOut, createTempOut, delete, dump, getLocation, getNewTempFile, getOut, hasTheSameContent, initFromProducer, openOut, remove, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

endedStreams

protected java.util.ArrayList<java.lang.Integer> endedStreams
Helper value for the merge() method. Ordinal numbers of the ended streams.

Constructor Detail

JaccardCoeficientsFile

public JaccardCoeficientsFile(java.lang.String location)
                       throws java.io.IOException,
                              DuplicityCheckingException
Throws:
java.io.IOException
DuplicityCheckingException
Method Detail

merge

public void merge(JaccardCoeficientsFile jcf1,
                  JaccardCoeficientsFile jcf2)
           throws java.io.IOException
Merges files externally on filesystem. Only convenience wrapper for mergeAll(org.egothor.duplicity.file.JaccardCoeficientsFile, java.util.ArrayList).

Parameters:
jcf1 - a file to be merged into this
jcf2 - a file to be merged into this
Throws:
java.io.IOException - if temporary file could not be created
See Also:
mergeAll(org.egothor.duplicity.file.JaccardCoeficientsFile, java.util.ArrayList)

mergeAll

protected void mergeAll(JaccardCoeficientsFile mergeTo,
                        java.util.ArrayList<JaccardCoeficientsFile> jcfs)
                 throws java.io.IOException
Merges multiple files externally, on filesystem. The method is written so that there is no problem if the output file is also between the input streams, because it uses temporary file and only after it finished merging, renames the result to output file. Warning: The content of the output file mergeTo will be discarded.

Parameters:
mergeTo - file where the result will be placed
jcfs - list of files to be merged. Can be temporal or permanent JaccardCoeficientsFile.
Throws:
java.io.IOException - if temporary file could not be created

mergeAll

protected void mergeAll(DataOutputStream dos,
                        java.util.ArrayList<DataInputStream> diss,
                        java.util.ArrayList<JaccardCoeficient> jcs)
                 throws java.io.IOException
Merges all given input streams to given output stream. Expects to get the leading element (first element not written to output) of each input stream in variable ups.

Parameters:
dos - DataOutputStream to which to write output
diss - list of DataInputStreams to be merged
jcs - leading elements of streams
Throws:
java.io.IOException - on error reading/writing from/to stream

markDuplicates

public java.util.Map<DocumentUnitID,java.lang.Double> markDuplicates(java.util.List<DocumentData> docs)
                                                              throws java.io.IOException,
                                                                     DuplicityCheckingException
Throws:
java.io.IOException
DuplicityCheckingException

getFilename

public java.lang.String getFilename()
Returns the filename corresponding to this file. The location field MUST be already set. The filename is created in directory given in location field and is in form Constants.JACCARD_COEFICIENTS_FILE_NAME.

Specified by:
getFilename in class DuplicityCheckingFile

createOut

protected void createOut()
                  throws java.io.IOException
Creates permanent file and sets the out field. Uses the DuplicityCheckingFile.createPermOut() method.

Specified by:
createOut in class DuplicityCheckingFile
Throws:
java.io.IOException - if the file already exists or could not be created
See Also:
DuplicityCheckingFile.createPermOut()

dump

public java.lang.String dump()
Dumps the file with its content to String.

Returns:
String representation of the file with its content

remove

public void remove(java.util.Set<DocumentUnitID> toRemove)
            throws java.io.IOException
Removes all occurences of documents given in the set from the file.

Parameters:
toRemove - set of document ids to remove
Throws:
java.io.IOException

filterRelevantForDocument

public java.util.Map<TextUnitID,JaccardCoeficient> filterRelevantForDocument(DocumentUnitID doc)
                                                                      throws java.io.IOException
Filter from the file only the entries relevant for given document. Relevant are considered the entries about the document, but only the one with the highest Jaccard coeficient for each text unit. Only entries for Jaccard coeficient above threshold value are returned (see Constants.SIMILARITY_RELEVANT_TRESHOLD).

Parameters:
doc - document for which the relevant entries are requested
Returns:
map containing the relevant entries
Throws:
java.io.IOException - on error while reading filesystem file

hasTheSameContent

public boolean hasTheSameContent(DuplicityCheckingFile file)
Checks if two files has the same content.

Parameters:
file - the second file to be tested
Returns:
true, if the file contents are the same