Written by java-style
on 2021-12-10

파이썬 활용 10 - 2 - 웹 스크래핑 (개요 및 패키지 활용) - 실습

01 find/find_all 실습

from urllib.request import urlopen from bs4 import BeautifulSoup as bs html = """ Test BeautifulSoup text contents text contents 2 text contents 3 """ bs = bs(html, 'html.parser') print(bs.find('p', align="center")) print(bs.find('p', class_="myp")) print(bs.find('p', a="b")) print("--------------------------------") print(bs.find('p', attrs={"align":"center"})) print(bs.find('p', attrs={"align":"right", "class":"myp"})) print(bs.find('p', attrs={"align":"left"})) text contents text contents 2 text contents 3 -------------------------------- text contents text contents 2 text contents 3

from urllib.request import urlopen from bs4 import BeautifulSoup as bs html = """ naver daum nate google yahoo """ data = bs(html, 'html.parser') links = data.find_all("a") print(links) print(type(links)) for i in links: href = i.attrs['href'] text = i.string print(text, " : ", href) [ naver , daum , nate , google , yahoo] naver : http://naver.com daum : http://daum.net nate : http://nate.com google : http://google.com yahoo : http://yahoo.com

동일 태그의 여러 속성들을 어떻게 포문으로 돌리는지 잘 기억 해야 한다.

기상청 API 스크래핑

from urllib.request import urlopen from bs4 import BeautifulSoup as bs import urllib.request as req url = "http://www.kma.go.kr/weather/forecast/mid-term-rss3.jsp" res = req.urlopen(url) soup = bs(res, 'html.parser') title = soup.find("title").string wf = soup.find("wf") print(title) print(wf) 기상청 육상 중기예보 ○ (강수) 제주도에는 13일 오전에 비 또는 눈, 16일 오후에 비가 오겠습니다.
○ (기온) 이번 예보기 간 아침 기온은 -8~6도, 낮 기온은 1~14도로 어제(9일, 아침최저기온 -1~9도, 낮최고기온 4~17도)보다 낮겠습니다.
○ (해상) 13일(월)은 대부분 해상에서 물결이 1.0~4.0m로 매우 높게 일겠습니다.

02 select 실습

from urllib.request import urlopen from bs4 import BeautifulSoup as bs import urllib.request as req html = """ 테스트 div1 div2 도서 목록 자바 프로그래밍 입문 HTML5 Python """ soup = bs(html, 'html.parser') h1 = soup.select_one("div#main > h1").string print("h1=",h1) li_list = soup.select("div#main > ul.items > li") for li in li_list: print("li=", li.string) h1= 도서 목록 li= 자바 프로그래밍 입문 li= HTML5 li= Python

*select() 시에 '#'은 특정 아이디, '.'은 특정 클래스를 긁어 올 수 있다.

select()를 사용하여 웹 사이트 스크래핑

import requests from bs4 import BeautifulSoup as bs req = requests.get("http://unico2013.dothome.co.kr/crawling/exercise_css.html") html = req.content html = html.decode('utf-8') soup = bs(html, 'html.parser') print(html) title = soup.select("h1") title1 = soup.select("#f_subtitle") title2 = soup.select(".subtitle") title3 = soup.select("aside > h2") img = soup.select('[src]') print(title) print(type(title)) print(type(title[0])) print("태그의 개수: %d" %len(title)) print("f_subtitle이라는 id 속성을 갖는 태그 개수: %d" %len(title1)) print("subtitle이라는 class 속성을 갖는 태그 개수: %d" %len(title2)) print("aside태그의 자식 태그 개수: %d" %len(title3)) print("src속성을 갖는 태그 개수: %d" %len(img)) [CSS 선택자 학습] 태그의 개수: 1 f_subtitle이라는 id 속성을 갖는 태그 개수: 1 subtitle이라는 class 속성을 갖는 태그 개수: 2 aside태그의 자식 태그 개수: 1 src속성을 갖는 태그 개수: 1

for content in title: print(content.string) print() for content in title1: print(content.text) print() for content in title2: print(content.text) print() for content in title3: print(content.text) print() for content in img: print(content["src"]) CSS 선택자 학습 교육과정 소개 웹 클라이언트 기술 학습 순서(수집) 학습 순서(수집) https://www.python.org/static/img/python-logo.png

사이트 속성 뽑기

import requests from bs4 import BeautifulSoup req = requests.get("http://unico2013.dothome.co.kr/crawling/exercise_bs.html") soup = BeautifulSoup(req.content, "html.parser") print("[태그의 콘텐츠]", soup.select("h1")[0].text) print('[텍스트 형식으로 내용을 가지고 있는 태그의 콘텐츠 + href 속성값]',) print() aTag = soup.select('a') for tag in aTag: if(tag.text.strip()): print(tag.text, ':', tag['href']) [태그의 콘텐츠] HTML의 링크 태그 [텍스트 형식으로 내용을 가지고 있는 태그의 콘텐츠 + href 속성값] World Wide Consortium : http://www.w3.org/ Java Page : http://java.sun.com/ Python Page : http://www.python.org/ Web Client 기술 학습 : http://www.w3schools.com/

soup.select("h1")[0].text 시에 [0]를 하는 이유는 select를 하면 ResultSet 객체로 반환되고 이는 list 형태이므로 인덱스를 하지 않으면 싱글 아이템 형식으로 처리한다는 오류가 발생 한다. 그래서 인덱스 처리를 해줘야 한다.

if(tag.text.strip()) 의 해석은 해당 태그에 내용이 있으면 으로 해석

print('[ 태그의 src 속성값]', soup.select("img")[0]['src']) #print('첫 번째 태그의 콘텐츠', soup.select("h2")[0].text) - 동일 결과 print('[첫 번째 태그의 콘텐츠]', soup.select('h2:nth-of-type(1)')[0].text) print('[ 태그의 자식 태그들 중 style 속성의 값이 green 으로 끝나는 태그의 콘텐츠]', soup.select("ul > li[style$=green]")[0].text) print('[두 번째 태그의 콘텐츠]', soup.select('h2:nth-of-type(2)')[0].text) print(' 태그의 모든 자식 태그들의 콘텐츠') olTags = soup.select('ol') for tag in olTags: print(tag.text) [ 태그의 src 속성값] http://unico2013.dothome.co.kr/image/duke.jpg [첫 번째 태그의 콘텐츠] 좋아하는 색 [ 태그의 자식 태그들 중 style 속성의 값이 green 으로 끝나는 태그의 콘텐츠] 녹색 [두 번째 태그의 콘텐츠] 먹고싶은 음식 태그의 모든 자식 태그들의 콘텐츠 짜장면 냉면 돈까스 갈비 print('[ 태그의 모든 자손 태그들의 콘텐츠]', soup.select('table')[0].text.strip()) print('name이라는 클래스 속성을 갖는 태그의 콘텐츠', soup.select('tr.name')[0].text) print('target이라는 아이디 속성을 갖는 태그의 콘텐츠', soup.select('td#target')[0].text) [ 태그의 모든 자손 태그들의 콘텐츠] 둘리또치도우너 케라토사우루스타조외계인 도봉구 쌍문동아프리카깐따삐아 별 name이라는 클래스 속성을 갖는 태그의 콘텐츠 둘리또치도우너 target이라는 아이디 속성을 갖는 태그의 콘텐츠 아프리카

select는 리스트 형식으로 리턴하므로 항상 인덱스를 줘야한다는것을 기억 하자! 태그안의 특정(src) 속성 - select("img")[0]['src']

*img 태그를 선택하고 리스트로 반환을 하니 인덱스를 주고 그 인덱스 안의 [''] 으로 받음 동일 태그중 몇번째 태그의 속성 - __:nth-of-type('번호') ... 으로 끝나는 내용 - $____ 특정 태그의 모든 내용은 포문을 돌리자

from http://treasure0326.tistory.com/145 by ccl(A) rewrite - 2021-12-10 12:01:23

Top