day 4] BeautifulSoup

2019. 5. 3. 22:29

1) footer부터 올라가서 객체 찾기

import requests
from bs4 import BeautifulSoup

resp = requests.get("http://pythonscraping.com/pages/page3.html")
html = resp.text
soup = BeautifulSoup(html, 'lxml')
element = soup.find(id='footer')
table = element.find_previous_sibling()

import re
[requests.compat.urljoin(resp.request.url, _["src"]) for _ in table.find_all("img", {"src":re.compile("../img/gifts/img[0-9]+.jpg")})]

1. BeautifulSoup
- element의 .name을 사용하면 태그이름 .attrs는 속성(key-value)
1) footer부터 올라가서 객체 찾기

import requests
from bs4 import BeautifulSoup

resp = requests.get("http://pythonscraping.com/pages/page3.html")
html = resp.text
soup = BeautifulSoup(html, 'lxml')
element = soup.find(id='footer')
table = element.find_previous_sibling()

import re
[requests.compat.urljoin(resp.request.url, _["src"]) for _ in table.find_all("img", {"src":re.compile("../img/gifts/img[0-9]+.jpg")})]

2. 구글 털기
1)  html 가져오기
url = "https://www.google.com/search?q=%EB%B0%95%EB%B3%B4%EC%98%81"
param = {"q":"박보영"}
google = download(url, param)

2) list형태로 가져오는 방법
result = list()
for _ in google.find_all("h3", {"class":"LC201b"}):
    result.append((_.text.strip(), _.find_parent()['href']))
print(result[0][0])

3) dictionary 형태로 가져오는 방법
result = dict()
for _ in google.find_all("h3", {"class":"LC201b"}):
    result[_.text.strip()] = _.find_parent()['href']
for k, v in result:
    print(k)
    print(v)

3. 네이버 검색결과 찾은 방법
dt를 모두 찾는다  => [_.find_parents(limit4)]로 부모가 dl->li->ul->div 순인지 확인
naver = BeautifulSoup(html.text, 'lxml')
for _ in naver.find_all("dt"):
    if "-".join([_.name for _ in _.find_parents(limit=4)]) =="dl-li-ul-div":
        a = _.find("a")

        if a:
            print(a.text.strip())
            print(a['href'])

4. 다음 검색결과 찾는 방법
i = 0
for _ in dom_daum.find_all(class_=["wrap_tit"]):
    a = _.find("a")
    if a:
        i += 1
        print(a.text.strip())
        print(a['href'])

print(i)

5. 네이버에서 실시간 검색어 불러오기

1) downlaod

html = download(“https://search.daum.net/search”, {“q”, “박보영”}

daum = BeautifulSoup(html.text, “lxml”)

[_[“href”] for _ in daum.select(“div.wrap_tit > a”)]

2) selector 사용해서 객체 가져오기

import requests

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlNaver, 'lxml')

result = soup.select(".ah_k")

for _ in result:

print(_.text)

6. 정리
1) for _ in ~ 구문
2) [for _ in ~] 구문
3) "-".join([_.name for _ in _.find_parents(limit=4)]) 구문
- 결과가 "dl-li-ul-div" 이런식으로 나옴 각 개체 사이에 앞에 있는 문자열이 들어가는 것
- find_parents(recursive=False) 는 바로 상위 계층으로만 이동
- find_parents(limit=4) 는 상위 4개까지 이동
4) find_all(class_=["wrap_tit", "BCD", "EFG"])
- 여러가지 조건을 넣을 수 있고 이는 or 형태로 모두 가능하도록 이동된다.
5) class:"abc" 를 했을 때 class:"abc d e f" 이것들이 모두 포함되어 가져오는 이유
- 웹에서는 클래스를 띄어쓰기로 하위 개체로 이동하기 때문에 상위 클래스인 'abc'만 요청하더라도 그 하위의 클래스를 모두 가져옴

'데이터 분석가 역량' 카테고리의 다른 글

day 7 ] selenium (0)	2019.05.09
day 6 ] parsing 익히기 (0)	2019.05.08
day 4] DOM (0)	2019.05.02
day 3 ] Requests (0)	2019.05.02
Day 2] Reqeusts (0)	2019.04.30

Stack Writing

day 4] BeautifulSoup

'데이터 분석가 역량' 카테고리의 다른 글

+ Recent posts

티스토리툴바