x-crawl

x-crawl is a flexible nodejs crawler library. It can crawl pages, control pages, batch network requests, batch download file resources, polling and crawling, etc. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.

If you feel good, you can give x-crawl repository a Star to support it, your Star will be the motivation for my update.

Features

Support asynchronous/synchronous way to crawl data.
Flexible writing, supporting multiple ways to write request configuration and obtain crawling results.
Flexible crawling interval, no interval/fixed interval/random interval, it is up to you to use/avoid high concurrent crawling.
Simple configuration can crawl pages, batch network requests, batch download file resources, polling and crawling, etc.
Crawl SPA (single-page application) to generate pre-rendered content (ie "SSR" (server-side rendering)), and use jsdom library to parse the content, and also supports self-parsing.
Form submissions, keystrokes, event actions, screenshots of generated pages, etc.
Capture and record the success and failure of crawling, and highlight the reminders.
Written in TypeScript, has types, provides generics.

Example

Timing capture: Take the automatic capture of the cover image of Airbnb Plus listings every day as an example:

// 1.Import module ES/CJS
import xCrawl from 'x-crawl'

// 2.Create a crawler instance
const myXCrawl = xCrawl({
  timeout: 10000, // overtime time
  intervalTime: { max: 3000, min: 2000 } // crawl interval
})

// 3.Set the crawling task
/* 
  Call the startPolling API to start the polling function, 
  and the callback function will be called every other day
*/
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
  // Call crawlPage API to crawl Page
  const { jsdom } = await myXCrawl.crawlPage('https://zh.airbnb.com/s/*/plus_homes')

  // Get the cover image elements for Plus listings
  const imgEls = jsdom.window.document
    .querySelector('.a1stauiv')
    ?.querySelectorAll('picture img')

  // set request configuration
  const requestConfig: string[] = []
  imgEls?.forEach((item) => requestConfig.push(item.src))

  // Call the crawlFile API to crawl pictures
  myXCrawl.crawlFile({ requestConfig, fileConfig: { storeDir: './upload' } })
})

running result:

Note: Do not crawl at will, you can check the robots.txt protocol before crawling. This is just to demonstrate how to use x-crawl.

For more detailed documentation, please check: github.com/coder-hxl/x-crawl

A flexible nodejs crawler library —— x-crawl

x-crawl

Features

Example

More