2020年2月21日 星期五

Python 爬蟲練習:自Goodinfo!抓取台灣半導體上市公司之當日PBR

TW_SEMICOND_PBR_Crawler

Python 爬蟲練習:自Goodinfo!抓取台灣半導體上市公司之當日PBR

步驟摘要:

  1. 建立股票代號清單
  2. 連線Goodinfo!並抓取HTML。(使用request套件)
  3. 解析抓下的HTML檔,從中找出PBR資料,並保留到空白清單中(使用BeautifulSoup4套件)
  4. 連線到MySQL
  5. 在MySQL建立本日的空白表格
  6. 將剛剛抓取的PBR資料存入MySQL中的本日表格
爬蟲概略流程

爬蟲教學請參考下面影片


其他說明

  • 大部分的網站都會鎖爬蟲程式,所以我用fake-useragent套件,建立了虛擬的User Agent共十組,來模仿這是10個不同裝置登入的樣子,避免同一台裝置因短時間內大量下載資料而被鎖住。但其實此練習僅抓72筆資料,算很小的下載量,此步驟可略過。感謝Goodinfo!提供完整且免費的資料啊~
  • 關於Python怎們跟MySQL溝通,我是使用MySQL個官方DB-API叫做mysql-connector-python,如何安裝跟簡易的操作請參考我的這篇

下面的code可存成.py檔,寫一個shell丟桌面。下班開電腦,點兩下就可以執行。

In [ ]:
#建立股票清單(台股半導體代號)
num= [2303, 2329, 2330, 2337, 2338, 2342, 2344, 2351, 2363, 2369, 2379,2388, 2401, 2408, 
      2434, 2436, 2441, 2449, 2451, 2454, 2458, 2481,3006, 3014, 3016, 3034, 3035, 3041, 
      3054, 3094, 3189, 3257, 3413,3443, 3530, 3532, 3536, 3545, 3583, 3588, 3661, 3686, 
      3711, 4919,4952, 4961, 4967, 4968, 5269, 5285, 5305, 5471, 6202, 6239, 6243, 6257, 
      6271, 6415, 6451, 6525, 6531, 6533, 6552, 6573, 8016, 8028,8081, 8110, 8131, 8150, 
      8261, 8271]
#載入必要模組
import requests #傳送request
import bs4 as bs #beautifulsoup4 解HTML用
import pandas as pd
import numpy as np


#建立空表格
df = pd.DataFrame({"Stock":num,"PBR":np.zeros(len(num))})
df.set_index("Stock", inplace= True)

##建立虛擬的Header User agent清單,防止IP被鎖。可用fake-useragent套件創建隨機user agent
ua=['Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1468.0 Safari/537.36',
 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1623.0 Safari/537.36',
 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36',
 'Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36',
 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17',
 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36',
 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
 'Mozilla/5.0 (Windows; U; Windows NT 6.1; de-DE) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4',
 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36']
#開始下載資料
try:
    for n in range(0,len(num)):
        PBR_Url = "https://goodinfo.tw/StockInfo/StockDetail.asp?STOCK_ID=%i" % num[n] #連線goodinfo
        header=ua[np.random.randint(0,10)]  #亂數抽一個user agent
        headerForReq={"User-Agent":header} #r建立dict: request header
        response_goodinfo = requests.post(PBR_Url, headers=headerForReq) #
        result = bs.BeautifulSoup(response_goodinfo.content, "html.parser")#提供 request header
        data = result.find_all("td")  #從抓取的HTML檔中,找出有<td>標籤, 建立名為data的list
        if data[102].string != "N/A": #arg 102 或103 (此公司沒有PER資料時會多一個<td>標籤) 是PBR在data中的位置  
            PBR = float(data[102].string)
            df.loc[num[n]]=PBR
            print('Stock %i' % num[n], 'PBR :', PBR, ",第",n+1,"/",len(num),"筆")
        elif data[102].string == "N/A":
            PBR = float(data[103].string)
            df.loc[num[n]]=PBR
            print('Stock %i' % num[n], 'PBR :', PBR, ",第",n+1,"/",len(num),"筆")
        else:  
            print('Stock %i' % num[n],"got ERROR")
except IndexError:
    print("檢查",header, ", 這個可能不能用。")
    print('或者IP可能又被鎖')

    
#連線資料庫        
import mysql.connector
HugoDB = mysql.connector.connect(
  host = "127.0.0.1",  # Input the database server IP
  user = "root", #Input the username
  password = "password", #input the username
  database = "MyDatabase", #input the name of the database
  )
cursor=HugoDB.cursor() 
#建立當日表格
import datetime as dt
today = str(dt.date.today().strftime("%Y_%m_%d"))#生成今日日期標籤 output_ex: 2099_12_31
CreateTableCmd="CREATE TABLE IF NOT EXISTS %s_TW_SEMICOND_PBR(Date DATE,Stock INT NOT NULL DEFAULT '0',PBR FLOAT(7,3) NOT NULL DEFAULT '0')ENGINE=MyISAM DEFAULT CHARSET=big5;"%today  #mysql建立新表格的指令
DropTableCmd="DROP TABLE IF EXISTS %s_TW_SEMICOND_PBR;"%today #刪除重複mysql建立新表格的指令
TableToday="%s_TW_SEMICOND_PBR"%today #今日表格 output_ex: 2099_12_31_TW_SEMICOND_PBR
cursor.execute(DropTableCmd)
cursor.execute(CreateTableCmd)

#建立資料清單
inputdata = []
for i in np.arange(0,len(df.index)):
    stock = df.index[i]
    inputdata.append((str(stock),str(float(df.loc[stock]))));  #全部轉成string格式, 防止放入cursor.excutemany()時出錯

#將當日資料存入MySQL    
sqlcmd="INSERT INTO "+TableToday +" (Date,Stock,PBR) VALUES (DATE(NOW()),%s, %s);"  #mysql command
cursor.executemany(sqlcmd,inputdata)
HugoDB.commit()    

print("已存入本日PBR")

抓取結果

mysql> SELECT * FROM 2020_02_22_TW_SEMICOND_PBR;
+------------+-------+--------+
| Date       | Stock | PBR    |
+------------+-------+--------+
| 2020-02-22 |  2303 |  0.920 |
| 2020-02-22 |  2329 |  1.200 |
| 2020-02-22 |  2330 |  5.310 |
| 2020-02-22 |  2337 |  2.080 |
| 2020-02-22 |  2338 |  2.260 |
| 2020-02-22 |  2342 |  1.810 |
| 2020-02-22 |  2344 |  1.130 |
| 2020-02-22 |  2351 |  2.010 |
| 2020-02-22 |  2363 |  0.620 |
| 2020-02-22 |  2369 |  0.740 |
| 2020-02-22 |  2379 |  4.680 |
| 2020-02-22 |  2388 |  4.210 |
| 2020-02-22 |  2401 |  0.850 |
| 2020-02-22 |  2408 |  1.700 |
| 2020-02-22 |  2434 |  2.180 |
| 2020-02-22 |  2436 |  1.420 |
| 2020-02-22 |  2441 |  1.700 |
| 2020-02-22 |  2449 |  1.670 |
| 2020-02-22 |  2451 |  1.760 |
| 2020-02-22 |  2454 |  1.970 |
| 2020-02-22 |  2458 |  3.820 |
| 2020-02-22 |  2481 |  1.340 |
| 2020-02-22 |  3006 |  1.350 |
| 2020-02-22 |  3014 |  1.730 |
| 2020-02-22 |  3016 |  3.810 |
| 2020-02-22 |  3034 |  4.000 |
| 2020-02-22 |  3035 |  2.370 |
| 2020-02-22 |  3041 |  0.990 |
| 2020-02-22 |  3054 |  2.090 |
| 2020-02-22 |  3094 |  1.380 |
| 2020-02-22 |  3189 |  0.970 |
| 2020-02-22 |  3257 |  2.580 |
| 2020-02-22 |  3413 |  3.880 |
| 2020-02-22 |  3443 |  7.740 |
| 2020-02-22 |  3530 |  4.080 |
| 2020-02-22 |  3532 |  2.820 |
| 2020-02-22 |  3536 |  1.010 |
| 2020-02-22 |  3545 |  1.160 |
| 2020-02-22 |  3583 |  1.760 |
| 2020-02-22 |  3588 |  1.030 |
| 2020-02-22 |  3661 |  4.100 |
| 2020-02-22 |  3686 |  0.460 |
| 2020-02-22 |  3711 |  1.590 |
| 2020-02-22 |  4919 |  2.380 |
| 2020-02-22 |  4952 |  1.660 |
| 2020-02-22 |  4961 |  1.490 |
| 2020-02-22 |  4967 |  2.420 |
| 2020-02-22 |  4968 |  7.850 |
| 2020-02-22 |  5269 | 23.130 |
| 2020-02-22 |  5285 |  2.070 |
| 2020-02-22 |  5305 |  1.700 |
| 2020-02-22 |  5471 |  1.970 |
| 2020-02-22 |  6202 |  4.030 |
| 2020-02-22 |  6239 |  2.140 |
| 2020-02-22 |  6243 |  1.340 |
| 2020-02-22 |  6257 |  1.380 |
| 2020-02-22 |  6271 |  2.630 |
| 2020-02-22 |  6415 |  7.000 |
| 2020-02-22 |  6451 |  2.550 |
| 2020-02-22 |  6525 |  2.240 |
| 2020-02-22 |  6531 |  4.360 |
| 2020-02-22 |  6533 |  4.270 |
| 2020-02-22 |  6552 |  3.350 |
| 2020-02-22 |  6573 |  1.050 |
| 2020-02-22 |  8016 |  3.690 |
| 2020-02-22 |  8028 |  3.960 |
| 2020-02-22 |  8081 |  2.480 |
| 2020-02-22 |  8110 |  0.500 |
| 2020-02-22 |  8131 |  1.630 |
| 2020-02-22 |  8150 |  1.170 |
| 2020-02-22 |  8261 |  1.560 |
| 2020-02-22 |  8271 |  1.630 |
+------------+-------+--------+
72 rows in set (0.00 sec)

沒有留言:

張貼留言