day 6 ] parsing 익히기

2019. 5. 8. 22:32

1. 이미지 다운로드

1) 요청

import requests

from bs4 import BeautifulSoup

url = "http://example.webscraping.com/"

resp = download(url)

dom = BeautifulSoup(resp.content, "lxml")

2) 해당 객체 가져오기

selectList = dom.select("a > img")

imgList = [requests.compat.urljoin(url, _['src']) for _ in selectList]

3) 다운로드

for src in imgList:

print(src)

html = download(src)

with open(src.split("/")[-1], "wb") as f:

f.write(html.content)

2. 체크기

https://validator.w3.org/unicorn/check

3. 인코딩 디코딩

- 해당 문서의 인코딩 타입을 가져오고 해당 타입을 기반으로 디코딩

1) 타입 체크

html.encoding

2) 디코딩

html.content.decode("euc-kr", "ignore").encode("utf-8").decode("utf-8")

4. 뽐뿌 크롤링

1) Dom 만들기

url = "http://www.ppomppu.co.kr/zboard/zboard.php"

params = {"id":"ppomppu"}

html = download(url, params)

dom = BeautifulSoup(html.content, "lxml")

2) 가져와서 출력하기

itemList = list()

for _ in dom.select("tr[class^='list']")[1:]:

item = dict()

tdList = _.find_all("td", recursive=False)

item['name'] = tdList[3].select_one("a > font").text.strip()

item['img'] = requests.compat.urljoin("http:",tdList[3].select_one("img")['src'])

item['url'] = requests.compat.urljoin(url, tdList[3].select_one("a")['href'])

item['reg'] = tdList[4].text.strip()

thumbs = tdList[5].text.strip().split('-')

item['thumbsup'] = tdList[5].text.strip() if len(thumbs) < 2 else thumbs[0]

item['thumbdown'] = tdList[5].text.strip() if len(thumbs) < 2 else thumbs[1]

item['hit'] = tdList[6].text.strip()

itemList.append(item)

itemList

3) html.parser의 문제점

- 오류가 있는 문서의 경우 가끔 lxml이 오류가 발생할 수 있다.

- 이를 보완해주는 html.parser를 사용했을 때 오류의 문서가 나올 수 있기 때문에 해당 dom을 찍어보고 거기서 직접 찾아보며 select문을 다시 짜야한다.

- 왠만하면 최신버전 lxml을 사용하되 lxml로 안되는 사이트인 경우에 html.parser를 사용하자

'데이터 분석가 역량' 카테고리의 다른 글

day 9 ] 데이터 학습 개론 (0)	2019.05.13
day 7 ] selenium (0)	2019.05.09
day 4] BeautifulSoup (0)	2019.05.03
day 4] DOM (0)	2019.05.02
day 3 ] Requests (0)	2019.05.02

Stack Writing

day 6 ] parsing 익히기

'데이터 분석가 역량' 카테고리의 다른 글

+ Recent posts

티스토리툴바