I am Anik Paul Gomes. I am from Bangladesh and currently attending Northwest Missouri State University.
The objective of this project is to use spark to count the words in a text document. For this project, I have counted words of a wikipedia page about Bangaldesh.
My data source is a wikipedia page about Bangladesh.
> val inputFile = sc.textFile("C:/44517/ConardSparkWordCount/AMSND.txt")
> val topWordCount = inputFile.
flatMap(str=>str.split(" ")).
filter(!_.isEmpty).
map(word=>(word,1)).
reduceByKey(_+_).
map{case (word, count) => (count, word)}.
sortByKey(false)
>topWordCount.take(10).foreach(x=>println(x))
| Frequency | Word |
|---|---|
| 992 | the |
| 600 | of |
| 577 | and |
| 458 | in |
| 242 | Bangladesh |
| 238 | The |
| 207 | a |
| 198 | to |
| 192 | is |
| 151 | was |
| 134 | by |
| 103 | with |
| 101 | are |
| 83 | Bengal |
| 80 | for |
| 76 | Bengali |
| 75 | has |
| 73 | as |
| 65 | Bangladeshi |
| 56 | from |
The most frequest word is “the”. I was expecting it to be Bangladesh. However, it’s 5th in the list. My findings agree with my assumptions. Bangladesh - 242 times, Bengal - 83 times, Bengali - 76, Bangladeshi - 65 times, are in my list (first 20).
