Thursday, April 13, 2006

Using Wikipedia as a natural language corpus

1) download the english yahoo abstracts from

http://en.wikipedia.org/wiki/Sort-Merge_Join

wiki.txt
Anarchism is derived from the Greek αναÏÏία ("without archons (ruler, chief, king)"). Anarchism as a political philosophy, is the belief that rulers, governments, and hierarchal social relationships are unnecessary and should be abolished, although there are differing interpretations of what this means.
|
Albedo is the measure of reflectivity of a surface or body. It is the ratio of electromagnetic radiation (EM radiation) reflected to the amount incident upon it.
Abu Dhabi (Arabic: أب٠ظب٠ʼAbÅ« áºaby) is the largest of the seven emirates that comprise the United Arab Emirates and was also the largest of the former Trucial States. Abu Dhabi is also a city of the same name within the Emirate that is the capital of the country, in north central UAE.


wiki.txt1
Anarchism
is
derived
from
the

wiki.txt2
is
derived
from
the
Greek

wiki.txt12
Anarchism is
is derived
derived from
from the
the Greek

wiki.txt.sorted
wiki.txt.sorted.1
wiki.txt.sorted.nonums
wiki.txt.sorted.nonums1
wiki.txt.sorted.nonums1.uniq
wiki.txt.sorted.nonums1.uniq.2words

export LC_ALL=C

cat
wordnet.index
wordnet.index.sorted
wordnet.index.sorted.uniq
wiki.txt.sorted.nonums1.uniq.2words.cols
wiki.txt.sorted.nonums1.uniq.2words.cols.LC_ALL
wordnet.index.sorted.uniq.LC_ALL
wiki-wordnet.join


for k in `cat wordnet.index.sorted.uniq.LC_ALL`; do look $k wiki.txt.sorted.nonums1.uniq.2words.cols.LC_ALL; done >/tar-backups/wiki-wordnet.join



3 word n-grams
grep -v '[^[:print:]]' wiki.txt1 >wiki1.txt
tail +2 wiki1.txt >wiki2.txt
tail +2 wiki2.txt >wiki3.txt

grep "[[:print:]]\+\t[[:print:]]\+\t[[:print:]]" wiki123.txt >wiki123.txt3



find the most used 3 word phrases in a text file
sort rewrite123.txt3 >rewrite123.txt3
grep ".\+\t.\+\t.\+" rewrite123.txt >rewrite123.txt3
sort rewrite123.txt3 >rewrite123.txt3.sorted
uniq -dc rewrite123.txt3.sorted >rewrite123.txt3.sorted.uniqd
sort -nr rewrite123.txt3.sorted.uniqd >rewrite123.txt3.sorted.uniqd.sorted
more rewrite123.txt3.sorted.uniqd.sorted

3 be traveling with
2 you want to
2 want to have
2 student group bookings.
2 part of their
2 on the other
2 notarized letter of
2 made with tap
2 entertainment , activities
2 and want to