A flexible nodejs crawler library —— x-crawl

·

2 min read

x-crawl

x-crawl is a flexible nodejs crawler library. It can crawl pages, control pages, batch network requests, batch download file resources, polling and crawling, etc. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.

If you feel good, you can give x-crawl repository a Star to support it, your Star will be the motivation for my update.

Features

  • Support asynchronous/synchronous way to crawl data.
  • Flexible writing, supporting multiple ways to write request configuration and obtain crawling results.
  • Flexible crawling interval, no interval/fixed interval/random interval, it is up to you to use/avoid high concurrent crawling.
  • Simple configuration can crawl pages, batch network requests, batch download file resources, polling and crawling, etc.
  • Crawl SPA (single-page application) to generate pre-rendered content (ie "SSR" (server-side rendering)), and use jsdom library to parse the content, and also supports self-parsing.
  • Form submissions, keystrokes, event actions, screenshots of generated pages, etc.
  • Capture and record the success and failure of crawling, and highlight the reminders.
  • Written in TypeScript, has types, provides generics.

Example

Timing capture: Take the automatic capture of the cover image of Airbnb Plus listings every day as an example:

// 1.Import module ES/CJS
import xCrawl from 'x-crawl'

// 2.Create a crawler instance
const myXCrawl = xCrawl({
  timeout: 10000, // overtime time
  intervalTime: { max: 3000, min: 2000 } // crawl interval
})

// 3.Set the crawling task
/* 
  Call the startPolling API to start the polling function, 
  and the callback function will be called every other day
*/
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
  // Call crawlPage API to crawl Page
  const { jsdom } = await myXCrawl.crawlPage('https://zh.airbnb.com/s/*/plus_homes')

  // Get the cover image elements for Plus listings
  const imgEls = jsdom.window.document
    .querySelector('.a1stauiv')
    ?.querySelectorAll('picture img')

  // set request configuration
  const requestConfig: string[] = []
  imgEls?.forEach((item) => requestConfig.push(item.src))

  // Call the crawlFile API to crawl pictures
  myXCrawl.crawlFile({ requestConfig, fileConfig: { storeDir: './upload' } })
})

running result:

Note: Do not crawl at will, you can check the robots.txt protocol before crawling. This is just to demonstrate how to use x-crawl.

More

For more detailed documentation, please check: github.com/coder-hxl/x-crawl