爬蟲教學請參考下面影片
其他說明
- 大部分的網站都會鎖爬蟲程式,所以我用fake-useragent套件,建立了虛擬的User Agent共十組,來模仿這是10個不同裝置登入的樣子,避免同一台裝置因短時間內大量下載資料而被鎖住。但其實此練習僅抓72筆資料,算很小的下載量,此步驟可略過。感謝Goodinfo!提供完整且免費的資料啊~
- 另外則是連線方式,因為Goodinfo!採HTTP POST,因此用request連線需要作一下處理。更多關於HTTP連線方式可參考 淺談 HTTP Method:表單中的 GET 與 POST 有什麼差別?。我只是業餘的,靠google大法來處理每次出現的Bug,還是得感謝各位無私分享知識的高手們。
- 關於Python怎們跟MySQL溝通,我是使用MySQL個官方DB-API叫做mysql-connector-python,如何安裝跟簡易的操作請參考我的這篇。
下面的code可存成.py檔,寫一個shell丟桌面。下班開電腦,點兩下就可以執行。
In [ ]:
#建立股票清單(台股半導體代號)
num= [2303, 2329, 2330, 2337, 2338, 2342, 2344, 2351, 2363, 2369, 2379,2388, 2401, 2408,
2434, 2436, 2441, 2449, 2451, 2454, 2458, 2481,3006, 3014, 3016, 3034, 3035, 3041,
3054, 3094, 3189, 3257, 3413,3443, 3530, 3532, 3536, 3545, 3583, 3588, 3661, 3686,
3711, 4919,4952, 4961, 4967, 4968, 5269, 5285, 5305, 5471, 6202, 6239, 6243, 6257,
6271, 6415, 6451, 6525, 6531, 6533, 6552, 6573, 8016, 8028,8081, 8110, 8131, 8150,
8261, 8271]
#載入必要模組
import requests #傳送request
import bs4 as bs #beautifulsoup4 解HTML用
import pandas as pd
import numpy as np
#建立空表格
df = pd.DataFrame({"Stock":num,"PBR":np.zeros(len(num))})
df.set_index("Stock", inplace= True)
##建立虛擬的Header User agent清單,防止IP被鎖。可用fake-useragent套件創建隨機user agent
ua=['Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1468.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1623.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; de-DE) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36']
#開始下載資料
try:
for n in range(0,len(num)):
PBR_Url = "https://goodinfo.tw/StockInfo/StockDetail.asp?STOCK_ID=%i" % num[n] #連線goodinfo
header=ua[np.random.randint(0,10)] #亂數抽一個user agent
headerForReq={"User-Agent":header} #r建立dict: request header
response_goodinfo = requests.post(PBR_Url, headers=headerForReq) #
result = bs.BeautifulSoup(response_goodinfo.content, "html.parser")#提供 request header
data = result.find_all("td") #從抓取的HTML檔中,找出有<td>標籤, 建立名為data的list
if data[102].string != "N/A": #arg 102 或103 (此公司沒有PER資料時會多一個<td>標籤) 是PBR在data中的位置
PBR = float(data[102].string)
df.loc[num[n]]=PBR
print('Stock %i' % num[n], 'PBR :', PBR, ",第",n+1,"/",len(num),"筆")
elif data[102].string == "N/A":
PBR = float(data[103].string)
df.loc[num[n]]=PBR
print('Stock %i' % num[n], 'PBR :', PBR, ",第",n+1,"/",len(num),"筆")
else:
print('Stock %i' % num[n],"got ERROR")
except IndexError:
print("檢查",header, ", 這個可能不能用。")
print('或者IP可能又被鎖')
#連線資料庫
import mysql.connector
HugoDB = mysql.connector.connect(
host = "127.0.0.1", # Input the database server IP
user = "root", #Input the username
password = "password", #input the username
database = "MyDatabase", #input the name of the database
)
cursor=HugoDB.cursor()
#建立當日表格
import datetime as dt
today = str(dt.date.today().strftime("%Y_%m_%d"))#生成今日日期標籤 output_ex: 2099_12_31
CreateTableCmd="CREATE TABLE IF NOT EXISTS %s_TW_SEMICOND_PBR(Date DATE,Stock INT NOT NULL DEFAULT '0',PBR FLOAT(7,3) NOT NULL DEFAULT '0')ENGINE=MyISAM DEFAULT CHARSET=big5;"%today #mysql建立新表格的指令
DropTableCmd="DROP TABLE IF EXISTS %s_TW_SEMICOND_PBR;"%today #刪除重複mysql建立新表格的指令
TableToday="%s_TW_SEMICOND_PBR"%today #今日表格 output_ex: 2099_12_31_TW_SEMICOND_PBR
cursor.execute(DropTableCmd)
cursor.execute(CreateTableCmd)
#建立資料清單
inputdata = []
for i in np.arange(0,len(df.index)):
stock = df.index[i]
inputdata.append((str(stock),str(float(df.loc[stock])))); #全部轉成string格式, 防止放入cursor.excutemany()時出錯
#將當日資料存入MySQL
sqlcmd="INSERT INTO "+TableToday +" (Date,Stock,PBR) VALUES (DATE(NOW()),%s, %s);" #mysql command
cursor.executemany(sqlcmd,inputdata)
HugoDB.commit()
print("已存入本日PBR")
抓取結果¶
mysql> SELECT * FROM 2020_02_22_TW_SEMICOND_PBR;
+------------+-------+--------+
| Date | Stock | PBR |
+------------+-------+--------+
| 2020-02-22 | 2303 | 0.920 |
| 2020-02-22 | 2329 | 1.200 |
| 2020-02-22 | 2330 | 5.310 |
| 2020-02-22 | 2337 | 2.080 |
| 2020-02-22 | 2338 | 2.260 |
| 2020-02-22 | 2342 | 1.810 |
| 2020-02-22 | 2344 | 1.130 |
| 2020-02-22 | 2351 | 2.010 |
| 2020-02-22 | 2363 | 0.620 |
| 2020-02-22 | 2369 | 0.740 |
| 2020-02-22 | 2379 | 4.680 |
| 2020-02-22 | 2388 | 4.210 |
| 2020-02-22 | 2401 | 0.850 |
| 2020-02-22 | 2408 | 1.700 |
| 2020-02-22 | 2434 | 2.180 |
| 2020-02-22 | 2436 | 1.420 |
| 2020-02-22 | 2441 | 1.700 |
| 2020-02-22 | 2449 | 1.670 |
| 2020-02-22 | 2451 | 1.760 |
| 2020-02-22 | 2454 | 1.970 |
| 2020-02-22 | 2458 | 3.820 |
| 2020-02-22 | 2481 | 1.340 |
| 2020-02-22 | 3006 | 1.350 |
| 2020-02-22 | 3014 | 1.730 |
| 2020-02-22 | 3016 | 3.810 |
| 2020-02-22 | 3034 | 4.000 |
| 2020-02-22 | 3035 | 2.370 |
| 2020-02-22 | 3041 | 0.990 |
| 2020-02-22 | 3054 | 2.090 |
| 2020-02-22 | 3094 | 1.380 |
| 2020-02-22 | 3189 | 0.970 |
| 2020-02-22 | 3257 | 2.580 |
| 2020-02-22 | 3413 | 3.880 |
| 2020-02-22 | 3443 | 7.740 |
| 2020-02-22 | 3530 | 4.080 |
| 2020-02-22 | 3532 | 2.820 |
| 2020-02-22 | 3536 | 1.010 |
| 2020-02-22 | 3545 | 1.160 |
| 2020-02-22 | 3583 | 1.760 |
| 2020-02-22 | 3588 | 1.030 |
| 2020-02-22 | 3661 | 4.100 |
| 2020-02-22 | 3686 | 0.460 |
| 2020-02-22 | 3711 | 1.590 |
| 2020-02-22 | 4919 | 2.380 |
| 2020-02-22 | 4952 | 1.660 |
| 2020-02-22 | 4961 | 1.490 |
| 2020-02-22 | 4967 | 2.420 |
| 2020-02-22 | 4968 | 7.850 |
| 2020-02-22 | 5269 | 23.130 |
| 2020-02-22 | 5285 | 2.070 |
| 2020-02-22 | 5305 | 1.700 |
| 2020-02-22 | 5471 | 1.970 |
| 2020-02-22 | 6202 | 4.030 |
| 2020-02-22 | 6239 | 2.140 |
| 2020-02-22 | 6243 | 1.340 |
| 2020-02-22 | 6257 | 1.380 |
| 2020-02-22 | 6271 | 2.630 |
| 2020-02-22 | 6415 | 7.000 |
| 2020-02-22 | 6451 | 2.550 |
| 2020-02-22 | 6525 | 2.240 |
| 2020-02-22 | 6531 | 4.360 |
| 2020-02-22 | 6533 | 4.270 |
| 2020-02-22 | 6552 | 3.350 |
| 2020-02-22 | 6573 | 1.050 |
| 2020-02-22 | 8016 | 3.690 |
| 2020-02-22 | 8028 | 3.960 |
| 2020-02-22 | 8081 | 2.480 |
| 2020-02-22 | 8110 | 0.500 |
| 2020-02-22 | 8131 | 1.630 |
| 2020-02-22 | 8150 | 1.170 |
| 2020-02-22 | 8261 | 1.560 |
| 2020-02-22 | 8271 | 1.630 |
+------------+-------+--------+
72 rows in set (0.00 sec)
沒有留言:
張貼留言