2020年6月22日 星期一

Web crawling By Puppeteer @ Node.js - Part 1/2

# Introduction


  • Puppeteer is a node.js library that achieves web crawling by an automated chromium browser (chrome open source project)
  • Unlike other methods needing to send pure http requests and parse responses yourself, even needing to handle cookies manually, your syntax is simply: 
  • Launch browser -> emulate real user operations on the web page -> get any data that is reachable by a real user

Web crawling by puppeteer is very simple.

This is part 1/2, in part 2/2 I'll demonstrate how to make it work on Android

# Getting Started

Start by installing:

  • npm install puppeteer


Then, use a simple example_code.js as below to get data behind a logged in account by automatically entering account_name/password and logging in:

(async ()=>{
  const puppeteer = require('puppeteer');

  // Initiate the Puppeteer browser
  // if headless is false, the chromium browser window will show up
  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();

try{

  //  Go to the login page's URL and sign in automatically.
  await page.goto( "http://your.login.page/url", { waitUntil: "networkidle0" });

  let data = await page.evaluate(() => {
  
    Account_Field_ID_Name.value = "Your Account Name";
    Password_Field_ID_Name.value = "Your Password";

    // These codes run in the context of the page
    // Same context when you right click on the page and press Inspect -> Console
    // So you can test for the code first in your browser's Console
    document.getElementsByName("Login_Button_Name")[0].click();

    return "The data you want to return from the current page to your node.js app";

  });

  // Give the browser time to navigate to the logged in page (timeout: 30seconds)
  await page.waitForNavigation();
  
  console.log(data);

  // This returns data from the page behind your logged in account
  data = await page.evaluate(()=>{
    
    return Some_Text_Fields_ID.value;

  });
  
  console.log(data);

}catch(e){console.log(e);}

  // Uncomment this section if you've finished developing and want to close the browser in the end
  // await browser.close();

})();

沒有留言: