[Git][java-team/libexternalsortinginjava-java][upstream] New upstream version 0.6.1
Andreas Tille (@tille)
gitlab at salsa.debian.org
Thu Jan 27 20:14:54 GMT 2022
Andreas Tille pushed to branch upstream at Debian Java Maintainers / libexternalsortinginjava-java
Commits:
5ef8e291 by Andreas Tille at 2022-01-27T20:56:45+01:00
New upstream version 0.6.1
- - - - -
18 changed files:
- + .github/workflows/java8.yml
- .gitignore
- .travis.yml
- README.md
- pom.xml
- + src/main/java/com/google/code/externalsorting/BinaryFileBuffer.java
- src/main/java/com/google/code/externalsorting/ExternalSort.java
- + src/main/java/com/google/code/externalsorting/IOStringStack.java
- + src/main/java/com/google/code/externalsorting/csv/CSVRecordBuffer.java
- + src/main/java/com/google/code/externalsorting/csv/CsvExternalSort.java
- + src/main/java/com/google/code/externalsorting/csv/CsvSortOptions.java
- + src/main/java/com/google/code/externalsorting/csv/SizeEstimator.java
- src/test/java/com/google/code/externalsorting/ExternalSortTest.java
- + src/test/java/com/google/code/externalsorting/csv/CsvExternalSortTest.java
- + src/test/resources/externalSorting.csv
- + src/test/resources/externalSortingSemicolon.csv
- + src/test/resources/externalSortingTabs.csv
- + src/test/resources/nonLatinSorting.csv
Changes:
=====================================
.github/workflows/java8.yml
=====================================
@@ -0,0 +1,16 @@
+name: Java CI
+
+on: [push]
+
+jobs:
+ build:
+ runs-on: ubuntu-latest
+
+ steps:
+ - uses: actions/checkout at v2
+ - name: Set up JDK 1.8
+ uses: actions/setup-java at v1
+ with:
+ java-version: 1.8
+ - name: Build with Maven
+ run: mvn -B package
=====================================
.gitignore
=====================================
@@ -1 +1,3 @@
-/target/
\ No newline at end of file
+/target/
+*.iml
+.idea
\ No newline at end of file
=====================================
.travis.yml
=====================================
@@ -1,8 +1,8 @@
language: java
jdk:
- - oraclejdk9
- - oraclejdk8
+ - openjdk8
+ - openjdk12
install: true
=====================================
README.md
=====================================
@@ -4,15 +4,15 @@ Externalsortinginjava
[![][maven img]][maven]
[![][license img]][license]
[![docs-badge][]][docs]
-[![Coverage Status](https://coveralls.io/repos/github/lemire/externalsortinginjava/badge.svg?branch=master)](https://coveralls.io/github/lemire/externalsortinginjava?branch=master)
+![Java CI](https://github.com/lemire/externalsortinginjava/workflows/Java%20CI/badge.svg)
External-Memory Sorting in Java: useful to sort very large files using multiple cores and an external-memory algorithm.
The versions 0.1 of the library are compatible with Java 6 and above. Versions 0.2 and above
-require at least Java 8.
+require at least Java 8.
-This code is used in [Apache Jackrabbit Oak](https://github.com/apache/jackrabbit-oak).
+This code is used in [Apache Jackrabbit Oak](https://github.com/apache/jackrabbit-oak) as well as in [Apache Beam](https://github.com/apache/beam) and in [Spotify scio](https://github.com/spotify/scio).
Code sample
------------
@@ -27,6 +27,41 @@ ExternalSort.mergeSortedFiles(ExternalSort.sortInBatch(new File(inputfile)), new
// you can also provide a custom string comparator, see API
```
+
+Code sample (CSV)
+------------
+
+For sorting CSV files, it might be more convenient to use `CsvExternalSort`.
+
+```java
+import com.google.code.externalsorting.CsvExternalSort;
+import com.google.code.externalsorting.CsvSortOptions;
+
+// provide a comparator
+Comparator<CSVRecord> comparator = (op1, op2) -> op1.get(0).compareTo(op2.get(0));
+//... inputfile: input file name
+//... outputfile: output file name
+//...provide sort options
+CsvSortOptions sortOptions = new CsvSortOptions
+ .Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory())
+ .charset(Charset.defaultCharset())
+ .distinct(false)
+ .numHeader(1)
+ .skipHeader(false)
+ .format(CSVFormat.DEFAULT)
+ .build();
+// container to store the header lines
+ArrayList<CSVRecord> header = new ArrayList<CSVRecord>();
+
+// next two lines sort the lines from inputfile to outputfile
+List<File> sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header);
+// at this point you can access header if you'd like.
+CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true, header);
+
+```
+
+The `numHeader` parameter is the number of lines of headers in the CSV files (typically 1 or 0) and the `skipHeader` parameter indicates whether you would like to exclude these lines from the parsing.
+
API Documentation
-----------------
@@ -40,7 +75,7 @@ Maven dependency
You can download the jar files from the Maven central repository:
-http://repo1.maven.org/maven2/com/google/code/externalsortinginjava/externalsortinginjava/
+https://repo1.maven.org/maven2/com/google/code/externalsortinginjava/externalsortinginjava/
You can also specify the dependency in the Maven "pom.xml" file:
@@ -49,7 +84,7 @@ You can also specify the dependency in the Maven "pom.xml" file:
<dependency>
<groupId>com.google.code.externalsortinginjava</groupId>
<artifactId>externalsortinginjava</artifactId>
- <version>[0.1.9,)</version>
+ <version>[0.6.0,)</version>
</dependency>
</dependencies>
```
@@ -60,6 +95,7 @@ How to build
- get the java jdk
- Install Maven 2
- mvn install - builds jar (requires signing)
+- or mvn package - builds jar (does not require signing)
- mvn test - runs tests
=====================================
pom.xml
=====================================
@@ -3,7 +3,7 @@
<groupId>com.google.code.externalsortinginjava</groupId>
<artifactId>externalsortinginjava</artifactId>
<packaging>jar</packaging>
- <version>0.2.5</version>
+ <version>0.6.1</version>
<name>externalsortinginjava</name>
<url>http://github.com/lemire/externalsortinginjava/</url>
<description>Sometimes, you want to sort large file without first loading them into memory. The solution is to use External Sorting. You divide the files into small blocks, sort each block in RAM, and then merge the result.
@@ -42,11 +42,18 @@
<version>4.12</version>
</dependency>
- <dependency>
- <groupId>com.github.jbellis</groupId>
- <artifactId>jamm</artifactId>
- <version>0.3.1</version>
- </dependency>
+ <dependency>
+ <groupId>com.github.jbellis</groupId>
+ <artifactId>jamm</artifactId>
+ <version>0.3.1</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.commons</groupId>
+ <artifactId>commons-csv</artifactId>
+ <version>1.9.0</version>
+ </dependency>
+
</dependencies>
</dependencyManagement>
<dependencies>
@@ -61,6 +68,10 @@
<scope>test</scope>
</dependency>
+ <dependency>
+ <groupId>org.apache.commons</groupId>
+ <artifactId>commons-csv</artifactId>
+ </dependency>
</dependencies>
<build>
<plugins>
@@ -165,11 +176,13 @@
</execution>
</executions>
</plugin>
- <plugin>
+ <plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-javadoc-plugin</artifactId>
<version>2.10.4</version>
-
+ <configuration>
+ <source>8</source>
+ </configuration>
<executions>
<execution>
<id>attach-javadocs</id>
@@ -198,6 +211,6 @@
<connection>scm:git:git at github.com:lemire/externalsortinginjava.git</connection>
<url>scm:git:git at github.com:lemire/externalsortinginjava.git</url>
<developerConnection>scm:git:git at github.com:lemire/externalsortinginjava.git</developerConnection>
- <tag>externalsortinginjava-0.2.5</tag>
+ <tag>externalsortinginjava-0.6.1</tag>
</scm>
</project>
=====================================
src/main/java/com/google/code/externalsorting/BinaryFileBuffer.java
=====================================
@@ -0,0 +1,42 @@
+package com.google.code.externalsorting;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+
+/**
+ * This is essentially a thin wrapper on top of a BufferedReader... which keeps
+ * the last line in memory.
+ *
+ */
+public final class BinaryFileBuffer implements IOStringStack {
+ public BinaryFileBuffer(BufferedReader r) throws IOException {
+ this.fbr = r;
+ reload();
+ }
+ public void close() throws IOException {
+ this.fbr.close();
+ }
+
+ public boolean empty() {
+ return this.cache == null;
+ }
+
+ public String peek() {
+ return this.cache;
+ }
+
+ public String pop() throws IOException {
+ String answer = peek().toString();// make a copy
+ reload();
+ return answer;
+ }
+
+ private void reload() throws IOException {
+ this.cache = this.fbr.readLine();
+ }
+
+ private BufferedReader fbr;
+
+ private String cache;
+
+}
\ No newline at end of file
=====================================
src/main/java/com/google/code/externalsorting/ExternalSort.java
=====================================
@@ -212,16 +212,16 @@ public class ExternalSort {
*/
public static long mergeSortedFiles(BufferedWriter fbw,
final Comparator<String> cmp, boolean distinct,
- List<BinaryFileBuffer> buffers) throws IOException {
- PriorityQueue<BinaryFileBuffer> pq = new PriorityQueue<>(
- 11, new Comparator<BinaryFileBuffer>() {
+ List<IOStringStack> buffers) throws IOException {
+ PriorityQueue<IOStringStack> pq = new PriorityQueue<>(
+ 11, new Comparator<IOStringStack>() {
@Override
- public int compare(BinaryFileBuffer i,
- BinaryFileBuffer j) {
+ public int compare(IOStringStack i,
+ IOStringStack j) {
return cmp.compare(i.peek(), j.peek());
}
});
- for (BinaryFileBuffer bfb : buffers) {
+ for (IOStringStack bfb : buffers) {
if (!bfb.empty()) {
pq.add(bfb);
}
@@ -230,13 +230,13 @@ public class ExternalSort {
try {
if (!distinct) {
while (pq.size() > 0) {
- BinaryFileBuffer bfb = pq.poll();
+ IOStringStack bfb = pq.poll();
String r = bfb.pop();
fbw.write(r);
fbw.newLine();
++rowcounter;
if (bfb.empty()) {
- bfb.fbr.close();
+ bfb.close();
} else {
pq.add(bfb); // add it back
}
@@ -244,19 +244,19 @@ public class ExternalSort {
} else {
String lastLine = null;
if(pq.size() > 0) {
- BinaryFileBuffer bfb = pq.poll();
+ IOStringStack bfb = pq.poll();
lastLine = bfb.pop();
fbw.write(lastLine);
fbw.newLine();
++rowcounter;
if (bfb.empty()) {
- bfb.fbr.close();
+ bfb.close();
} else {
pq.add(bfb); // add it back
}
}
while (pq.size() > 0) {
- BinaryFileBuffer bfb = pq.poll();
+ IOStringStack bfb = pq.poll();
String r = bfb.pop();
// Skip duplicate lines
if (cmp.compare(r, lastLine) != 0) {
@@ -266,7 +266,7 @@ public class ExternalSort {
}
++rowcounter;
if (bfb.empty()) {
- bfb.fbr.close();
+ bfb.close();
} else {
pq.add(bfb); // add it back
}
@@ -274,7 +274,7 @@ public class ExternalSort {
}
} finally {
fbw.close();
- for (BinaryFileBuffer bfb : pq) {
+ for (IOStringStack bfb : pq) {
bfb.close();
}
}
@@ -282,6 +282,7 @@ public class ExternalSort {
}
+
/**
* This merges a bunch of temporary flat files
*
@@ -392,7 +393,7 @@ public class ExternalSort {
public static long mergeSortedFiles(List<File> files, File outputfile,
final Comparator<String> cmp, Charset cs, boolean distinct,
boolean append, boolean usegzip) throws IOException {
- ArrayList<BinaryFileBuffer> bfbs = new ArrayList<>();
+ ArrayList<IOStringStack> bfbs = new ArrayList<>();
for (File f : files) {
final int BUFFERSIZE = 2048;
InputStream in = new FileInputStream(f);
@@ -419,6 +420,53 @@ public class ExternalSort {
return rowcounter;
}
+ /**
+ * This merges a bunch of temporary flat files
+ *
+ * @param files The {@link List} of sorted {@link File}s to be merged.
+ * @param distinct Pass <code>true</code> if duplicate lines should be
+ * discarded.
+ * @param fbw The output {@link BufferedWriter} to merge the results to.
+ * @param cmp The {@link Comparator} to use to compare
+ * {@link String}s.
+ * @param cs The {@link Charset} to be used for the byte to
+ * character conversion.
+ * @param usegzip assumes we used gzip compression for temporary files
+ * @return The number of lines sorted.
+ * @throws IOException generic IO exception
+ * @since v0.1.4
+ */
+ public static long mergeSortedFiles(List<File> files, BufferedWriter fbw,
+ final Comparator<String> cmp, Charset cs, boolean distinct,
+ boolean usegzip) throws IOException {
+ ArrayList<IOStringStack> bfbs = new ArrayList<>();
+ for (File f : files) {
+ final int BUFFERSIZE = 2048;
+ if (f.length() == 0) {
+ continue;
+ }
+ InputStream in = new FileInputStream(f);
+ BufferedReader br;
+ if (usegzip) {
+ br = new BufferedReader(
+ new InputStreamReader(
+ new GZIPInputStream(in,
+ BUFFERSIZE), cs));
+ } else {
+ br = new BufferedReader(new InputStreamReader(
+ in, cs));
+ }
+
+ BinaryFileBuffer bfb = new BinaryFileBuffer(br);
+ bfbs.add(bfb);
+ }
+ long rowcounter = mergeSortedFiles(fbw, cmp, distinct, bfbs);
+ for (File f : files) {
+ f.delete();
+ }
+ return rowcounter;
+ }
+
/**
* This sorts a file (input) to an output file (output) using default
* parameters
@@ -447,7 +495,7 @@ public class ExternalSort {
*/
public static void sort(final File input, final File output, final Comparator<String> cmp)
throws IOException {
- ExternalSort.mergeSortedFiles(ExternalSort.sortInBatch(input),
+ ExternalSort.mergeSortedFiles(ExternalSort.sortInBatch(input, cmp),
output, cmp);
}
@@ -876,41 +924,3 @@ public class ExternalSort {
public static final int DEFAULTMAXTEMPFILES = 1024;
}
-
-/**
- * This is essentially a thin wrapper on top of a BufferedReader... which keeps
- * the last line in memory.
- *
- */
-final class BinaryFileBuffer {
- public BinaryFileBuffer(BufferedReader r) throws IOException {
- this.fbr = r;
- reload();
- }
- public void close() throws IOException {
- this.fbr.close();
- }
-
- public boolean empty() {
- return this.cache == null;
- }
-
- public String peek() {
- return this.cache;
- }
-
- public String pop() throws IOException {
- String answer = peek().toString();// make a copy
- reload();
- return answer;
- }
-
- private void reload() throws IOException {
- this.cache = this.fbr.readLine();
- }
-
- public BufferedReader fbr;
-
- private String cache;
-
-}
=====================================
src/main/java/com/google/code/externalsorting/IOStringStack.java
=====================================
@@ -0,0 +1,18 @@
+package com.google.code.externalsorting;
+
+import java.io.IOException;
+
+/**
+ * General interface to abstract away BinaryFileBuffer
+ * so that users of the library can roll their own.
+ */
+public interface IOStringStack {
+ public void close() throws IOException;
+
+ public boolean empty();
+
+ public String peek();
+
+ public String pop() throws IOException;
+
+}
\ No newline at end of file
=====================================
src/main/java/com/google/code/externalsorting/csv/CSVRecordBuffer.java
=====================================
@@ -0,0 +1,46 @@
+package com.google.code.externalsorting.csv;
+
+import java.io.IOException;
+import java.util.Iterator;
+
+import org.apache.commons.csv.CSVParser;
+import org.apache.commons.csv.CSVRecord;
+
+public class CSVRecordBuffer {
+
+ private Iterator<CSVRecord> iterator;
+
+ private CSVParser parser;
+
+ private CSVRecord cache;
+
+ public CSVRecordBuffer(CSVParser parser) throws IOException, ClassNotFoundException {
+ this.iterator = parser.iterator();
+ this.parser = parser;
+ reload();
+ }
+
+ public void close() throws IOException {
+ this.parser.close();
+ }
+
+ public boolean empty() {
+ return this.cache == null;
+ }
+
+ public CSVRecord peek() {
+ return this.cache;
+ }
+
+ //
+ public CSVRecord pop() throws IOException, ClassNotFoundException {
+ CSVRecord answer = peek();// make a copy
+ reload();
+ return answer;
+ }
+
+ // Get the next in line
+ private void reload() throws IOException, ClassNotFoundException {
+ this.cache = this.iterator.hasNext() ? this.iterator.next() : null;
+ }
+}
=====================================
src/main/java/com/google/code/externalsorting/csv/CsvExternalSort.java
=====================================
@@ -0,0 +1,227 @@
+package com.google.code.externalsorting.csv;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.OutputStreamWriter;
+import java.io.Writer;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.List;
+import java.util.PriorityQueue;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.logging.Level;
+import java.util.logging.Logger;
+
+import org.apache.commons.csv.CSVFormat;
+import org.apache.commons.csv.CSVParser;
+import org.apache.commons.csv.CSVPrinter;
+import org.apache.commons.csv.CSVRecord;
+
+public class CsvExternalSort {
+
+ private static final Logger LOG = Logger.getLogger(CsvExternalSort.class.getName());
+
+ private CsvExternalSort() {
+ throw new UnsupportedOperationException("Unable to instantiate utility class");
+ }
+
+ /**
+ * This method calls the garbage collector and then returns the free memory.
+ * This avoids problems with applications where the GC hasn't reclaimed memory
+ * and reports no available memory.
+ *
+ * @return available memory
+ */
+ public static long estimateAvailableMemory() {
+ System.gc();
+ return Runtime.getRuntime().freeMemory();
+ }
+
+ /**
+ * we divide the file into small blocks. If the blocks are too small, we shall
+ * create too many temporary files. If they are too big, we shall be using too
+ * much memory.
+ *
+ * @param sizeoffile how much data (in bytes) can we expect
+ * @param maxtmpfiles how many temporary files can we create (e.g., 1024)
+ * @param maxMemory Maximum memory to use (in bytes)
+ * @return the estimate
+ */
+ public static long estimateBestSizeOfBlocks(final long sizeoffile, final int maxtmpfiles, final long maxMemory) {
+ // we don't want to open up much more than maxtmpfiles temporary
+ // files, better run
+ // out of memory first.
+ long blocksize = sizeoffile / maxtmpfiles + (sizeoffile % maxtmpfiles == 0 ? 0 : 1);
+
+ // on the other hand, we don't want to create many temporary
+ // files
+ // for naught. If blocksize is smaller than half the free
+ // memory, grow it.
+ if (blocksize < maxMemory / 6) {
+ blocksize = maxMemory / 6;
+ }
+ return blocksize;
+ }
+
+ public static int mergeSortedFiles(BufferedWriter fbw, final CsvSortOptions sortOptions, List<CSVRecordBuffer> bfbs, List<CSVRecord> header)
+ throws IOException, ClassNotFoundException {
+ PriorityQueue<CSVRecordBuffer> pq = new PriorityQueue<CSVRecordBuffer>(11, new Comparator<CSVRecordBuffer>() {
+ @Override
+ public int compare(CSVRecordBuffer i, CSVRecordBuffer j) {
+ return sortOptions.getComparator().compare(i.peek(), j.peek());
+ }
+ });
+ for (CSVRecordBuffer bfb : bfbs)
+ if (!bfb.empty())
+ pq.add(bfb);
+ int rowcounter = 0;
+ CSVPrinter printer = new CSVPrinter(fbw, sortOptions.getFormat());
+ if(! sortOptions.isSkipHeader()) {
+ for(CSVRecord r: header) {
+ printer.printRecord(r);
+ }
+ }
+ CSVRecord lastLine = null;
+ try {
+ while (pq.size() > 0) {
+ CSVRecordBuffer bfb = pq.poll();
+ CSVRecord r = bfb.pop();
+ // Skip duplicate lines
+ if (sortOptions.isDistinct() && checkDuplicateLine(r, lastLine)) {
+ } else {
+ printer.printRecord(r);
+ lastLine = r;
+ }
+ ++rowcounter;
+ if (bfb.empty()) {
+ bfb.close();
+ } else {
+ pq.add(bfb); // add it back
+ }
+ }
+ } finally {
+ printer.close();
+ fbw.close();
+ for (CSVRecordBuffer bfb : pq)
+ bfb.close();
+ }
+
+ return rowcounter;
+ }
+
+ public static int mergeSortedFiles(List<File> files, File outputfile, final CsvSortOptions sortOptions,
+ boolean append, List<CSVRecord> header) throws IOException, ClassNotFoundException {
+
+ List<CSVRecordBuffer> bfbs = new ArrayList<CSVRecordBuffer>();
+ for (File f : files) {
+ InputStream in = new FileInputStream(f);
+ BufferedReader fbr = new BufferedReader(new InputStreamReader(in, sortOptions.getCharset()));
+ CSVParser parser = new CSVParser(fbr, sortOptions.getFormat());
+ CSVRecordBuffer bfb = new CSVRecordBuffer(parser);
+ bfbs.add(bfb);
+ }
+
+ BufferedWriter fbw = new BufferedWriter(
+ new OutputStreamWriter(new FileOutputStream(outputfile, append), sortOptions.getCharset()));
+
+ int rowcounter = mergeSortedFiles(fbw, sortOptions, bfbs, header);
+ for (File f : files) {
+ if (!f.delete()) {
+ LOG.log(Level.WARNING, String.format("The file %s was not deleted", f.getName()));
+ }
+ }
+
+ return rowcounter;
+ }
+
+ public static List<File> sortInBatch(long size_in_byte, final BufferedReader fbr, final File tmpdirectory,
+ final CsvSortOptions sortOptions, List<CSVRecord> header) throws IOException {
+
+ List<File> files = new ArrayList<File>();
+ long blocksize = estimateBestSizeOfBlocks(size_in_byte, sortOptions.getMaxTmpFiles(),
+ sortOptions.getMaxMemory());// in
+ // bytes
+ AtomicLong currentBlock = new AtomicLong(0);
+ List<CSVRecord> tmplist = new ArrayList<CSVRecord>();
+
+ try (CSVParser parser = new CSVParser(fbr, sortOptions.getFormat())) {
+ parser.spliterator().forEachRemaining(e -> {
+ if (e.getRecordNumber() <= sortOptions.getNumHeader()) {
+ header.add(e);
+ } else {
+ tmplist.add(e);
+ currentBlock.addAndGet(SizeEstimator.estimatedSizeOf(e));
+ }
+ if (currentBlock.get() >= blocksize) {
+ try {
+ files.add(sortAndSave(tmplist, tmpdirectory, sortOptions));
+ } catch (IOException e1) {
+ LOG.log(Level.WARNING, String.format("Error during the sort in batch"), e1);
+ }
+ tmplist.clear();
+ currentBlock.getAndSet(0);
+ }
+ });
+ }
+ if (!tmplist.isEmpty()) {
+ files.add(sortAndSave(tmplist, tmpdirectory, sortOptions));
+ }
+
+ return files;
+ }
+
+ public static File sortAndSave(List<CSVRecord> tmplist, File tmpdirectory, final CsvSortOptions sortOptions) throws IOException {
+ Collections.sort(tmplist, sortOptions.getComparator());
+ File newtmpfile = File.createTempFile("sortInBatch", "flatfile", tmpdirectory);
+ newtmpfile.deleteOnExit();
+
+ CSVRecord lastLine = null;
+ try (Writer writer = new OutputStreamWriter(new FileOutputStream(newtmpfile), sortOptions.getCharset());
+ CSVPrinter printer = new CSVPrinter(new BufferedWriter(writer), sortOptions.getFormat());) {
+ for (CSVRecord r : tmplist) {
+ // Skip duplicate lines
+ if (sortOptions.isDistinct() && checkDuplicateLine(r, lastLine)) {
+ } else {
+ printer.printRecord(r);
+ lastLine = r;
+ }
+ }
+ }
+
+ return newtmpfile;
+ }
+
+ private static boolean checkDuplicateLine(CSVRecord currentLine, CSVRecord lastLine) {
+ if (lastLine == null || currentLine == null) {
+ return false;
+ }
+
+ for (int i = 0; i < currentLine.size(); i++) {
+ if (!currentLine.get(i).equals(lastLine.get(i))) {
+ return false;
+ }
+ }
+ return true;
+ }
+
+ public static List<File> sortInBatch(File file, File tmpdirectory, final CsvSortOptions sortOptions, List<CSVRecord> header)
+ throws IOException {
+ try (BufferedReader fbr = new BufferedReader(
+ new InputStreamReader(new FileInputStream(file), sortOptions.getCharset()))) {
+ return sortInBatch(file.length(), fbr, tmpdirectory, sortOptions, header);
+ }
+ }
+
+ /**
+ * Default maximal number of temporary files allowed.
+ */
+ public static final int DEFAULTMAXTEMPFILES = 1024;
+
+}
=====================================
src/main/java/com/google/code/externalsorting/csv/CsvSortOptions.java
=====================================
@@ -0,0 +1,116 @@
+package com.google.code.externalsorting.csv;
+
+import org.apache.commons.csv.CSVFormat;
+import org.apache.commons.csv.CSVRecord;
+
+import java.nio.charset.Charset;
+import java.util.Comparator;
+
+/**
+ * Parameters for csv sorting
+ */
+public class CsvSortOptions {
+ private final Comparator<CSVRecord> comparator;
+ private final int maxTmpFiles;
+ private final long maxMemory;
+ private final Charset charset;
+
+ private final boolean distinct;
+ private final int numHeader; //number of header row in input file
+ private final boolean skipHeader; //print header or not to output file
+ private final CSVFormat format;
+
+ public Comparator<CSVRecord> getComparator() {
+ return comparator;
+ }
+
+ public int getMaxTmpFiles() {
+ return maxTmpFiles;
+ }
+
+ public long getMaxMemory() {
+ return maxMemory;
+ }
+
+ public Charset getCharset() {
+ return charset;
+ }
+
+ public boolean isDistinct() {
+ return distinct;
+ }
+
+ public int getNumHeader() {
+ return numHeader;
+ }
+
+ public boolean isSkipHeader() {
+ return skipHeader;
+ }
+
+ public CSVFormat getFormat() {
+ return format;
+ }
+
+ public static class Builder {
+ //mandatory params
+ private final Comparator<CSVRecord> cmp;
+ private final int maxTmpFiles;
+ private final long maxMemory;
+
+ //optional params with default values
+ private Charset cs = Charset.defaultCharset();
+ private boolean distinct = false;
+ private int numHeader = 0;
+ private boolean skipHeader = true;
+ private CSVFormat format = CSVFormat.DEFAULT;
+
+ public Builder(Comparator<CSVRecord> cmp, int maxTmpFiles, long maxMemory) {
+ this.cmp = cmp;
+ this.maxTmpFiles = maxTmpFiles;
+ this.maxMemory = maxMemory;
+ }
+
+ public Builder charset(Charset value){
+ cs = value;
+ return this;
+ }
+
+ public Builder distinct(boolean value){
+ distinct = value;
+ return this;
+ }
+
+ public Builder numHeader(int value){
+ numHeader = value;
+ return this;
+ }
+
+ public Builder skipHeader(boolean value){
+ skipHeader = value;
+ return this;
+ }
+
+ public Builder format(CSVFormat value){
+ format = value;
+ return this;
+ }
+
+
+ public CsvSortOptions build(){
+ return new CsvSortOptions(this);
+ }
+ }
+
+ private CsvSortOptions(Builder builder){
+ this.comparator = builder.cmp;
+ this.maxTmpFiles = builder.maxTmpFiles;
+ this.maxMemory = builder.maxMemory;
+ this.charset = builder.cs;
+ this.distinct = builder.distinct;
+ this.numHeader = builder.numHeader;
+ this.skipHeader = builder.skipHeader;
+ this.format = builder.format;
+ }
+
+}
=====================================
src/main/java/com/google/code/externalsorting/csv/SizeEstimator.java
=====================================
@@ -0,0 +1,56 @@
+package com.google.code.externalsorting.csv;
+
+public final class SizeEstimator {
+
+ private static int OBJ_HEADER;
+ private static int ARR_HEADER;
+ private static int INT_FIELDS = 12;
+ private static int OBJ_REF;
+ private static int OBJ_OVERHEAD;
+ private static boolean IS_64_BIT_JVM;
+
+ private SizeEstimator() {
+
+ }
+
+ /**
+ * Class initializations.
+ */
+ static {
+ // By default we assume 64 bit JVM
+ // (defensive approach since we will get
+ // larger estimations in case we are not sure)
+ IS_64_BIT_JVM = true;
+ // check the system property "sun.arch.data.model"
+ // not very safe, as it might not work for all JVM implementations
+ // nevertheless the worst thing that might happen is that the JVM is 32bit
+ // but we assume its 64bit, so we will be counting a few extra bytes per string object
+ // no harm done here since this is just an approximation.
+ String arch = System.getProperty("sun.arch.data.model");
+ if (arch != null) {
+ if (arch.indexOf("32") != -1) {
+ // If exists and is 32 bit then we assume a 32bit JVM
+ IS_64_BIT_JVM = false;
+ }
+ }
+ // The sizes below are a bit rough as we don't take into account
+ // advanced JVM options such as compressed oops
+ // however if our calculation is not accurate it'll be a bit over
+ // so there is no danger of an out of memory error because of this.
+ OBJ_HEADER = IS_64_BIT_JVM ? 16 : 8;
+ ARR_HEADER = IS_64_BIT_JVM ? 24 : 12;
+ OBJ_REF = IS_64_BIT_JVM ? 8 : 4;
+ OBJ_OVERHEAD = OBJ_HEADER + INT_FIELDS + OBJ_REF + ARR_HEADER;
+
+ }
+
+ /**
+ * Estimates the size of a object in bytes.
+ *
+ * @param s The string to estimate memory footprint.
+ * @return The <strong>estimated</strong> size in bytes.
+ */
+ public static long estimatedSizeOf(Object s) {
+ return ((long) (s.toString().length() * 2) + OBJ_OVERHEAD);
+ }
+}
=====================================
src/test/java/com/google/code/externalsorting/ExternalSortTest.java
=====================================
@@ -4,24 +4,17 @@ import static org.junit.Assert.assertArrayEquals;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNotNull;
import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.assertFalse;
-import java.io.BufferedReader;
-import java.io.File;
-import java.io.FileInputStream;
-import java.io.FileOutputStream;
-import java.io.FileReader;
-import java.io.IOException;
+import java.io.*;
import java.nio.channels.FileChannel;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.StandardOpenOption;
-import java.util.ArrayList;
-import java.util.Arrays;
-import java.util.Comparator;
-import java.util.List;
-import java.util.Scanner;
+import java.util.*;
+import java.util.concurrent.atomic.AtomicBoolean;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
@@ -458,4 +451,79 @@ public class ExternalSortTest {
return path;
}
+ /**
+ * Sort with a custom comparator.
+ * @throws IOException
+ */
+ @Test
+ public void sortWithCustomComparator() throws IOException {
+ Random rand = new Random();
+ final Path path = Files.createTempFile("TestCsvWithLongIds", ".csv");
+ final Path pathSorted = Files.createTempFile("TestCsvWithLongIdsSorted", ".csv");
+ Set<Long> sortedIds = new TreeSet<>();
+ try (FileWriter fw = new FileWriter(path.toFile());
+ BufferedWriter bw = new BufferedWriter(fw)) {
+ for (int i = 0; i < 1000; ++i) {
+ long nextLong = rand.nextLong();
+ sortedIds.add(nextLong);
+ bw.write(String.format("%d,%s\n", nextLong, UUID.randomUUID().toString()));
+ }
+ }
+ AtomicBoolean wasCalled = new AtomicBoolean(false);
+ ExternalSort.sort(path.toFile(), pathSorted.toFile(), (lhs, rhs) -> {
+ Long lhsLong = lhs.indexOf(',') == -1 ? Long.MAX_VALUE : Long.parseLong(lhs.split(",")[0]);
+ Long rhsLong = rhs.indexOf(',') == -1 ? Long.MAX_VALUE : Long.parseLong(rhs.split(",")[0]);
+ wasCalled.set(true);
+ return lhsLong.compareTo(rhsLong);
+ });
+ assertTrue("The custom comparator was not called!", wasCalled.get());
+ Iterator<Long> idIter = sortedIds.iterator();
+ try (FileReader fr = new FileReader(pathSorted.toFile());
+ BufferedReader bw = new BufferedReader(fr)) {
+ String nextLine = bw.readLine();
+ Long lhsLong = nextLine.indexOf(',') == -1 ? Long.MAX_VALUE : Long.parseLong(nextLine.split(",")[0]);
+ Long nextId = idIter.next();
+ assertEquals(lhsLong, nextId);
+ }
+ }
+
+ @Test
+ public void lowMaxMemory() throws IOException {
+ String unsortedContent =
+ "Val1,Data2,Data3,Data4\r\n" +
+ "Val2,Data2,Data4,Data5\r\n" +
+ "Val1,Data2,Data3,Data5\r\n" +
+ "Val2,Data2,Data6,Data7\r\n";
+ InputStream bis = new ByteArrayInputStream(unsortedContent.getBytes(StandardCharsets.UTF_8));
+ File tmpDirectory = Files.createTempDirectory("sort").toFile();
+ tmpDirectory.deleteOnExit();
+
+ BufferedReader inputReader = new BufferedReader(new InputStreamReader(bis, StandardCharsets.UTF_8));
+ List<File> tmpSortedFiles = ExternalSort.sortInBatch(
+ inputReader,
+ unsortedContent.length(),
+ ExternalSort.defaultcomparator,
+ Integer.MAX_VALUE, // use an unlimited number of temp files
+ 100, // max memory
+ StandardCharsets.UTF_8,
+ tmpDirectory,
+ false, // no distinct
+ 0, // no header lines to skip
+ false, // don't use gzip
+ true); // parallel
+ File tmpOutputFile = File.createTempFile("merged", "", tmpDirectory);
+ tmpOutputFile.deleteOnExit();
+ BufferedWriter outputWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream((tmpOutputFile))));
+ ExternalSort.mergeSortedFiles(
+ tmpSortedFiles,
+ outputWriter,
+ ExternalSort.defaultcomparator,
+ StandardCharsets.UTF_8,
+ false, // no distinct
+ false); // don't use gzip
+
+ for (File tmpSortedFile: tmpSortedFiles) {
+ assertFalse(tmpSortedFile.exists());
+ }
+ }
}
=====================================
src/test/java/com/google/code/externalsorting/csv/CsvExternalSortTest.java
=====================================
@@ -0,0 +1,222 @@
+package com.google.code.externalsorting.csv;
+
+import org.apache.commons.csv.CSVFormat;
+import org.apache.commons.csv.CSVRecord;
+import org.junit.After;
+import org.junit.Test;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileReader;
+import java.io.IOException;
+import java.lang.reflect.Field;
+import java.nio.charset.Charset;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.List;
+import java.util.ArrayList;
+import java.util.Map;
+
+import static org.junit.Assert.assertEquals;
+
+
+public class CsvExternalSortTest {
+ private static final String FILE_CSV = "externalSorting.csv";
+ private static final String FILE_UNICODE_CSV = "nonLatinSorting.csv";
+
+ private static final String FILE_CSV_WITH_TABS = "externalSortingTabs.csv";
+ private static final String FILE_CSV_WITH_SEMICOOLONS = "externalSortingSemicolon.csv";
+ private static final char SEMICOLON = ';';
+
+ File outputfile;
+
+ @Test
+ public void testMultiLineFile() throws IOException, ClassNotFoundException {
+ String path = this.getClass().getClassLoader().getResource(FILE_CSV).getPath();
+
+ File file = new File(path);
+
+ outputfile = new File("outputSort1.csv");
+
+ Comparator<CSVRecord> comparator = (op1, op2) -> op1.get(0)
+ .compareTo(op2.get(0));
+
+ CsvSortOptions sortOptions = new CsvSortOptions
+ .Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory())
+ .charset(Charset.defaultCharset())
+ .distinct(false)
+ .numHeader(1)
+ .skipHeader(true)
+ .format(CSVFormat.DEFAULT)
+ .build();
+ ArrayList<CSVRecord> header = new ArrayList<CSVRecord>();
+
+ List<File> sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header);
+
+ assertEquals(1, sortInBatch.size());
+
+ int mergeSortedFiles = CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true, header);
+
+ assertEquals(4, mergeSortedFiles);
+
+ BufferedReader reader = new BufferedReader(new FileReader(outputfile));
+ String readLine = reader.readLine();
+
+ assertEquals("6,this wont work in other systems,3", readLine);
+ reader.close();
+ }
+
+ @Test
+ public void testNonLatin() throws Exception {
+ Field cs = Charset.class.getDeclaredField("defaultCharset");
+ cs.setAccessible(true);
+ cs.set(null, Charset.forName("windows-1251"));
+
+ String path = this.getClass().getClassLoader().getResource(FILE_UNICODE_CSV).getPath();
+
+ File file = new File(path);
+
+ outputfile = new File("unicode_output.csv");
+
+ Comparator<CSVRecord> comparator = (op1, op2) -> op1.get(0)
+ .compareTo(op2.get(0));
+
+ CsvSortOptions sortOptions = new CsvSortOptions
+ .Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory())
+ .charset(StandardCharsets.UTF_8)
+ .distinct(false)
+ .numHeader(1)
+ .skipHeader(true)
+ .format(CSVFormat.DEFAULT)
+ .build();
+ ArrayList<CSVRecord> header = new ArrayList<CSVRecord>();
+ List<File> sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header);
+
+ assertEquals(1, sortInBatch.size());
+
+ int mergeSortedFiles = CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true, header);
+
+ assertEquals(5, mergeSortedFiles);
+
+ List<String> lines = Files.readAllLines(Paths.get(outputfile.getPath()), StandardCharsets.UTF_8);
+
+ assertEquals("2,זה רק טקסט אחי לקריאה קשה,8", lines.get(0));
+ assertEquals("5,هذا هو النص إخوانه فقط من الصعب القراءة,3", lines.get(1));
+ assertEquals("6,это не будет работать в других системах,3", lines.get(2));
+ }
+
+
+ @Test
+ public void testCVSFormat() throws Exception {
+ Map<CSVFormat, Pair> map = new HashMap<CSVFormat, Pair>(){{
+ put(CSVFormat.MYSQL, new Pair(FILE_CSV_WITH_TABS, "6 \"this wont work in other systems\" 3"));
+ put(CSVFormat.EXCEL.withDelimiter(SEMICOLON), new Pair(FILE_CSV_WITH_SEMICOOLONS, "6;this wont work in other systems;3"));
+ }};
+
+ for (Map.Entry<CSVFormat, Pair> format : map.entrySet()){
+ String path = this.getClass().getClassLoader().getResource(format.getValue().getFileName()).getPath();
+
+ File file = new File(path);
+
+ outputfile = new File("outputSort1.csv");
+
+ Comparator<CSVRecord> comparator = (op1, op2) -> op1.get(0)
+ .compareTo(op2.get(0));
+
+ CsvSortOptions sortOptions = new CsvSortOptions
+ .Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory())
+ .charset(Charset.defaultCharset())
+ .distinct(false)
+ .numHeader(1)
+ .skipHeader(true)
+ .format(format.getKey())
+ .build();
+ ArrayList<CSVRecord> header = new ArrayList<CSVRecord>();
+ List<File> sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header);
+
+ assertEquals(1, sortInBatch.size());
+
+ int mergeSortedFiles = CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, false, header);
+
+ assertEquals(4, mergeSortedFiles);
+
+ List<String> lines = Files.readAllLines(outputfile.toPath());
+
+ assertEquals(format.getValue().getExpected(), lines.get(0));
+ assertEquals(4, lines.size());
+ }
+ }
+
+ @Test
+ public void testMultiLineFileWthHeader() throws IOException, ClassNotFoundException {
+ String path = this.getClass().getClassLoader().getResource(FILE_CSV).getPath();
+
+ File file = new File(path);
+
+ outputfile = new File("outputSort1.csv");
+
+ Comparator<CSVRecord> comparator = (op1, op2) -> op1.get(0)
+ .compareTo(op2.get(0));
+
+ CsvSortOptions sortOptions = new CsvSortOptions
+ .Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory())
+ .charset(Charset.defaultCharset())
+ .distinct(false)
+ .numHeader(1)
+ .skipHeader(false)
+ .format(CSVFormat.DEFAULT)
+ .build();
+ ArrayList<CSVRecord> header = new ArrayList<CSVRecord>();
+ List<File> sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header);
+
+ assertEquals(1, sortInBatch.size());
+
+ int mergeSortedFiles = CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true, header);
+
+ List<String> lines = Files.readAllLines(outputfile.toPath(), sortOptions.getCharset());
+
+ assertEquals("personId,text,ishired", lines.get(0));
+ assertEquals("6,this wont work in other systems,3", lines.get(1));
+ assertEquals("6,this wont work in other systems,3", lines.get(2));
+ assertEquals("7,My Broken Text will break you all,1", lines.get(3));
+ assertEquals("8,this is only bro text for hard read,2", lines.get(4));
+ assertEquals(5, lines.size());
+
+ }
+
+ @After
+ public void onTearDown() {
+ if(outputfile.exists()) {
+ outputfile.delete();
+ }
+ }
+
+ private class Pair {
+ private String fileName;
+ private String expected;
+
+ public Pair(String fileName, String expected) {
+ this.fileName = fileName;
+ this.expected = expected;
+ }
+
+ public String getFileName() {
+ return fileName;
+ }
+
+ public void setFileName(String fileName) {
+ this.fileName = fileName;
+ }
+
+ public String getExpected() {
+ return expected;
+ }
+
+ public void setExpected(String expected) {
+ this.expected = expected;
+ }
+ }
+}
=====================================
src/test/resources/externalSorting.csv
=====================================
@@ -0,0 +1,5 @@
+personId,text,ishired
+7,"My Broken Text will break you all",1
+8,"this is only bro text for hard read",2
+6,"this wont work in other systems",3
+6,"this wont work in other systems",3
\ No newline at end of file
=====================================
src/test/resources/externalSortingSemicolon.csv
=====================================
@@ -0,0 +1,5 @@
+personId;text;ishired
+7;"My Broken Text will break you all";1
+8;"this is only bro text for hard read";2
+6;"this wont work in other systems";3
+6;"this wont work in other systems";3
\ No newline at end of file
=====================================
src/test/resources/externalSortingTabs.csv
=====================================
@@ -0,0 +1,5 @@
+personId text ishired
+7 "My Broken Text will break you all" 1
+8 "this is only bro text for hard read" 2
+6 "this wont work in other systems" 3
+6 "this wont work in other systems" 3
\ No newline at end of file
=====================================
src/test/resources/nonLatinSorting.csv
=====================================
@@ -0,0 +1,6 @@
+personId,text,ishired
+7,"My Broken Text will break you all",1
+2,"זה רק טקסט אחי לקריאה קשה",8
+6,"это не будет работать в других системах",3
+6,"это не будет работать в других системах",3
+5,"هذا هو النص إخوانه فقط من الصعب القراءة",3
\ No newline at end of file
View it on GitLab: https://salsa.debian.org/java-team/libexternalsortinginjava-java/-/commit/5ef8e291c6195a2014c5978acccf53790c030427
--
View it on GitLab: https://salsa.debian.org/java-team/libexternalsortinginjava-java/-/commit/5ef8e291c6195a2014c5978acccf53790c030427
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/pkg-java-commits/attachments/20220127/c24f55f1/attachment.htm>
More information about the pkg-java-commits
mailing list