å建ä¸ä¸ªæ°ç½ç«ï¼ä¸å¼å§æ²¡æå 容ï¼é常éè¦æåå ¶ä»äººçç½é¡µå 容ï¼ä¸è¬çæä½æ¥éª¤å¦ä¸ï¼
æ ¹æ®urlä¸è½½ç½é¡µå 容ï¼é对æ¯ä¸ªç½é¡µçhtmlç»æç¹å¾ï¼å©ç¨æ£å表达å¼ï¼æè å ¶ä»çæ¹å¼ï¼åææ¬è§£æï¼æååºæ³è¦çæ£æã
为æ¯ä¸ªç½é¡µåç¹å¾åæè¿ä¸ªè¿æ¯å¤ªèè´¹å¼åçæ¶é´ï¼æçæè·¯æ¯è¿æ ·çã
PythonçBeautifulSoupå 大家é½ç¥éå§ï¼
import BeautifulSoupsoup = BeautifulSoup.BeautifulSoup(html)
å©ç¨è¿ä¸ªå å æhtmléscriptï¼styleç»æ¸ çäºï¼
[script.extract() for script in soup.findAll('script')][style.extract() for style in soup.findAll('style')]
æ¸ çå®æåï¼è¿ä¸ªå æä¸ä¸ªprettify()å½æ°ï¼æ代ç æ ¼å¼ç»æçæ åä¸äºï¼
soup.prettify()
ç¶åç¨æ£å表达å¼ï¼æææçHTMLæ ç¾å ¨é¨æ¸ çäºï¼
reg1 = re.compile("<[^>]*>")content = reg1.sub('',soup.prettify())
å©ä¸çé½æ¯çº¯ææ¬çæ件äºï¼é常æ¯ä¸è¡è¡çï¼æ空ç½è¡ç»æé¤äºï¼è¿æ ·å°±ä¼ç¥éæ»è®¡æå¤å°è¡ï¼æ¯è¡çå符æ°æå¤å°ï¼æç¨excelæäºä¸äºæ¯è¡å符æ°çç»è®¡ï¼å¦ä¸å¾ï¼
xåæ 为è¡æ°ï¼yåæ 为该è¡çå符æ°
å¾ææ¾ï¼ä¼æä¸ä¸ªå³°å¼ï¼81~91è¡å°±åºè¯¥æ¯è¿ä¸ªç½é¡µçæ£æé¨åãæåªéè¦æå81~91è¡çæåå°±è¡äºã
é®é¢æ¥äºï¼ç §çè¿ä¸ªæè·¯ï¼æä»ä¹å¥½çç®æ³è½å¤éè¿æ°æ®åæçæ¹å¼ç»è®¡åºé¿ææ¬çå³°å¼å¨åªå è¡ï¼
BeautifulSoupä¸ä» ä» åªæ¯å¯ä»¥æ¥æ¾ï¼å®ä½åä¿®æ¹ææ¡£å 容ï¼åæ ·ä¹å¯ä»¥ç¨ä¸ä¸ªå¥½ç æ ¼å¼è¿è¡è¾åºæ¾ç¤ºãBeautifulSoupå¯ä»¥å¤çä¸åç±»åçè¾åºï¼
æ ¼å¼åçè¾åº
éæ ¼å¼åçè¾åº
æ ¼å¼åè¾åº
BeautifulSoupä¸æå ç½®çæ¹æ³prettfy()æ¥å®ç°æ ¼å¼åè¾åºãæ¯å¦ï¼
view plain copy print ?
from bs4 import BeautifulSoup
html_markup = âââ<p class=âecopyramidâ>
<ul id=âproducersâ>
<li class=âproducerlistâ>
<div class=ânameâ>plants</div>
<div class=ânumberâ>100000</div>
</li>
<li class=âproducerlistâ>
<div class=ânameâ>algae</div>
Output in Beautiful Soup
<div class=ânumberâ>100000</div>
</li>
</ul>âââ
soup = BeautifulSoup(html_markup, âlxmlâ )
print (soup.prettify())