Open Source For Geeks: Remove repetitive words from a String and return it

Thursday 28 April 2016

Remove repetitive words from a String and return it

Question

Assume you are given a string, write a function that can remove consecutive repeated words in the string and return it. For example,

"Beginning with the first first first first manned Gemini mission in March 1965, commemorative commemorative medallions were prepared for the astronauts at their request. It is unclear who prepared these early medallions only only, only that each individual box containing a medallion bore the word Fliteline."

to

"Beginning with the first manned Gemini mission in March 1965, commemorative medallions were prepared for the astronauts at their request. It is unclear who prepared these early medallions only, only that each individual box containing a medallion bore the word Fliteline."

Make sure you preserving punctuations in the right positions.

Solution

I have written following Java code with various methods that solve variations of above question -

import java.util.HashSet;
import java.util.Set;
import java.util.Stack;
import java.util.StringTokenizer;

/**
 * 
 * @author athakur
 *
 */
public class RemoveDuplicates {

    public static void main(String args[]) {

        String input = "Beginning with the first first first first manned Gemini mission in March 1965, commemorative commemorative medallions were prepared for the astronauts at their request. It is unclear who prepared these early medallions only only, only that each individual box containing a medallion bore the word Fliteline.";

        System.out.println(removeConsecutiveDuplicates(input));
        System.out.println(removeConsecutiveDuplicatesBetter(input));
        System.out.println(removeAllDuplicates(input));

    }

    public static String removeAllDuplicates(String input) {
        StringBuilder outputBuilder = new StringBuilder();
        Set<String> wordsSet = new HashSet<>();
        StringTokenizer tokenizer = new StringTokenizer(input, " ");
        while (tokenizer.hasMoreTokens()) {
            String word = tokenizer.nextToken();
            if (wordsSet.add(word)) {
                outputBuilder.append(word);
                outputBuilder.append(" ");
            }
        }
        return outputBuilder.toString();
    }

    public static String removeConsecutiveDuplicatesBetter(String input) {

        StringBuilder outputBuilder = new StringBuilder();
        String lastWord = "";
        StringTokenizer tokenizer = new StringTokenizer(input, " ");
        while (tokenizer.hasMoreTokens()) {
            String word = tokenizer.nextToken();
            if (!word.equalsIgnoreCase(lastWord)) {
                outputBuilder.append(word);
                outputBuilder.append(" ");
            }
            lastWord = word;
        }
        return outputBuilder.toString();

    }

    public static String removeConsecutiveDuplicates(String input) {

        if (input == null)
            return null;

        Stack<String> wordsStack = new Stack<>();
        String[] words = input.split(" ");
        for (int i = 0; i < words.length; i++) {
            wordsStack.push(words[i]);
        }

        String lastWord = "";
        String stringWithNoConsecutiveDuplicates = "";

        while (!wordsStack.isEmpty()) {
            String tempPop = wordsStack.pop();
            if (!lastWord.equalsIgnoreCase(tempPop)) {
                stringWithNoConsecutiveDuplicates = tempPop + " "
                        + stringWithNoConsecutiveDuplicates;
            }

            lastWord = tempPop;
        }

        return stringWithNoConsecutiveDuplicates;

    }

}

Note

Couple of points to note. Prefer removeConsecutiveDuplicatesBetter approach as it does not keep all data in memory. So lets say you have a huge file you need to do this operation on you read it line by line and then tokenize it to further process it.

Also note HashSet has add method which returns false if data is already present in the set.

Stack method is more if you want to retain the duplicate consecutive words which are farthest in a line. Irrespective of your approach you need to take care of following -

Don't keep all the data in the memory (data can be very huge). Using Stack may led to overflow.
Use minimum space and more efficient lookup. Hashset internally used HashMap to keep the data hence put/get is ~O(1).
Avoid String concatenation. Use StringBuilder instead.

As you must have already noticed above approaches do not take care of punctuations. You will need separate processing for that.

Thursday 28 April 2016

Remove repetitive words from a String and return it

Question

Solution

Note

Related Links

No comments:

Post a Comment