网页长截图

2020-04-13 2024-12-31

tech

5 minutes read (About 734 words)

网页长截图

雄文十万，也挡不住管理方的删帖封号，本文就来说一下怎么通过技术的方式来保存网页内容

python, splider

python & bs4 基础

2016-09-19 2024-12-31

tech

7 minutes read (About 1000 words)

python & bs4

如果基于正则表达式来爬取网页，真的是太麻烦，而且正则要学得好，还真不容易。通过 bs4 select 或者 find 返回soup对象，可以很方便地提取出HTML或XML标签中的内容，简直不能更方便

举例：

req = urllib2.Request(target_url, headers = _headers)
myPage = urllib2.urlopen(req).read().decode(self.encoding)
soup = BeautifulSoup(myPage,'lxml')

dom_tag_a = soup.select('div[class*="right_wrap"] > div[class*="content"] > div[class*="phref"] > a')

python, splider

网页长截图

网页长截图

python & bs4 基础

python & bs4

Tag Cloud

Archives

Recent

Categories

Recent

Categories

Your browser is out-of-date!