day 16 ] PreProcessing => Lexicon까지 정리 및 실습

2019. 5. 22. 22:25

1. retrieval 정리

1) Indexer
- Crawler ( Focuesd ) => Repository( Collection이라고 부름 )
- Document Analyzer => HTML, Tokenizing, Normalizing, Stemming( BPE ), N-Gram, MA, Pos, Stopwords, RE, Pharases
    => PreProcessing
- Features => Lexicon
- Document (Query) Representation => Back Of Word 모델
   Document-Term Mat., Term-Dcoument Mat.(핵심)
   => Inverted Document Indexing(역문헌구조)
2) Relevance(Ranking)
- Weighting(TF-IDF), Similarity(Cosine: 0-1, Euclidean)
- Sorting
3) Results
- 끝 (Top K)

2. 실습

1) Lexicon 만들기

- 전체 문서에서 단어의 집합(중복 X)

# Lexicon 만들기
from konlpy.corpus import kobill

def getLexicon():
    lexicon = list()
    for document in [kobill.open(idx).read() for idx in kobill.fileids()]:
        # split
        for term in document.split():
                lexicon.append(term)

        return list(set(lexicon))

2) Bow 구하기

- 배열 형태로 어떤 문서에 어떤 단어가 몇번씩 나왔는 지

#       1  2  3  ...  2638
# doc1  1  0  1  ...    1
# doc2  0  1  1  ...    0

# BoW 가져오기
def documentRepresentation1():
    documentList = list()
    for document in [kobill.open(idx).read() for idx in kobill.fileids()]:
        # 차트 초기화
        bow = list(_ for _ in range(len(lexicon)))
        for term in document.split():
            bow[lexicon.index(term)] = 1
        documentList.append(bow)
    return documentList

3) BoW 심화

- 각 단어에 매칭시켜서 필요한 것만 만들도록 딕셔너리를 사용

# w1 w2 w3 ... w2638
# doc1 1 0 1 ... 1
# doc2 0 1 1 ... 0

# BoW 가져오기
def documentRepresentation2():
    documentList = list()
    for document in [kobill.open(idx).read() for idx in kobill.fileids()]:
        # 차트 초기화
        bow = dict()
        for term in document.split():
            bow[term] = 1
        documentList.append(bow)
    return documentList

4) 빈도로 표현

# BoW 단점
# 0과 1만으로 표현하기에는 각 데이터들의 중요도를 확인하기 어렵다. => 나오는 빈도도 함께 표현한다 (+1 으로 점차 늘리면서)
from collections import defaultdict # 키가 없을 때 에러가 없도록 하기 위함
def documentRepresentation3():
    documentList = list()
    for document in [kobill.open(idx).read() for idx in kobill.fileids()]:
        # 차트 초기화
        bow = defaultdict(int)
        for term in document.split():
            bow[term] += 1
        documentList.append(bow)
    return documentList

5) 각 문서 정보 추가

# 제목이 매칭되지 않았기 때문에 어떤 문서인지 알 수 없기 때문에 문서 제목을 저장하는 형태로 변경
def documentRepresentation4():
    documentList = defaultdict(lambda: defaultdict(int))
    for idx in kobill.fileids():
        for term in kobill.open(idx).read().split():
            documentList[idx][term] += 1
    return documentList

6) 검색 해보기

# Boolean 검색 => 집합론
#각 단어를 각 문서에서 찾음
query = "국회 의원 국민"

def booleanResult1():
    result = list()
    for term in query.split():
        searchResult = list()
        for idx, termList in docList.items():
            if term in termList.keys():
                searchResult.append(idx)
        result.append(searchResult)

    #[['1809897.txt', '1809898.txt'],
    # ['1809896.txt'],
    #['1809896.txt', '1809897.txt']]

    one = result.pop() # 최초 한개 뽑아오고 시작
    while result: # 리스트를 하나씩 빼와서 다 비교할 때까지 조사
        temp = result.pop()
        one = list(set(one).intersection(temp)) # 교집합을 찾음
    return one

7) DMT => TDM

- 단어를 기반으로 검색하기 때문에 단어를 key로 가지면 바로 검색 가능

- 따라서 TDM 형태로 만들고 key가 되는 단어를 따로 변수로 가지고 있고 문서를 DB형태(Posting)으로 가짐

- 필요하면 hash에서 찾고 해당 문서를 뒤져서 가져오면 됨

# 키값이 되는 단어는 hash로 넣어두고 어느 문서에 저장되어 있는지는 DB(Posting)에 저장해 둠(파이썬에서는 포인터가 없으니 인덱스를 사용)

TDM = defaultdict(lambda: defaultdict(int))
for idx, termList in docList.items():
    for term, freq in termList.items():
        TDM[term][idx] = freq

#TDM : 효율을 위하여 행과 열을 바꿈, 키를 가지고 문서를 가져오기 때문에 훨씬 빠르게 처리할 수 있음
#       어떤 문서에 있는 지 다 보기가 힘듦(Key로 둘 것인가 Value로 둘 것인가의 차이)
def booleanResult2():
    result = list()
    for term in query.split():
        result.append(list(TDM[term].keys()))

    one = result.pop()
    while result:
        temp = result.pop()
        one = list(set(one).intersection(temp))
    return one

'데이터 분석가 역량' 카테고리의 다른 글

Day 17 ] TF-IDF (0)	2019.05.24
day 17 ] Document, Lexicon, Posting 구조 (0)	2019.05.24
day 16 ] 전체 과정 정리(막 정리함...) (0)	2019.05.22
day 15 ] 내 데이터로 출력해보기 (0)	2019.05.21
day 15 ] grammar (0)	2019.05.21

Stack Writing

day 16 ] PreProcessing => Lexicon까지 정리 및 실습

'데이터 분석가 역량' 카테고리의 다른 글

+ Recent posts

티스토리툴바