A flexible nodejs crawler library —— x-crawl
x-crawl
x-crawl is a flexible nodejs crawler library. It can crawl pages, control pages, batch network requests, batch download file resources, polling and crawling, etc. Support asynchronous/synchronous mode crawling data. Running on nodejs, the usage is flexible and simple, friendly to JS/TS developers.
If you feel good, you can give x-crawl repository a Star to support it, your Star will be the motivation for my update.
Features
- Support asynchronous/synchronous way to crawl data.
- Flexible writing, supporting multiple ways to write request configuration and obtain crawling results.
- Flexible crawling interval, no interval/fixed interval/random interval, it is up to you to use/avoid high concurrent crawling.
- Simple configuration can crawl pages, batch network requests, batch download file resources, polling and crawling, etc.
- Crawl SPA (single-page application) to generate pre-rendered content (ie "SSR" (server-side rendering)), and use jsdom library to parse the content, and also supports self-parsing.
- Form submissions, keystrokes, event actions, screenshots of generated pages, etc.
- Capture and record the success and failure of crawling, and highlight the reminders.
- Written in TypeScript, has types, provides generics.
Example
Timing capture: Take the automatic capture of the cover image of Airbnb Plus listings every day as an example:
// 1.Import module ES/CJS
import xCrawl from 'x-crawl'
// 2.Create a crawler instance
const myXCrawl = xCrawl({
timeout: 10000, // overtime time
intervalTime: { max: 3000, min: 2000 } // crawl interval
})
// 3.Set the crawling task
/*
Call the startPolling API to start the polling function,
and the callback function will be called every other day
*/
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
// Call crawlPage API to crawl Page
const { jsdom } = await myXCrawl.crawlPage('https://zh.airbnb.com/s/*/plus_homes')
// Get the cover image elements for Plus listings
const imgEls = jsdom.window.document
.querySelector('.a1stauiv')
?.querySelectorAll('picture img')
// set request configuration
const requestConfig: string[] = []
imgEls?.forEach((item) => requestConfig.push(item.src))
// Call the crawlFile API to crawl pictures
myXCrawl.crawlFile({ requestConfig, fileConfig: { storeDir: './upload' } })
})
running result:
Note: Do not crawl at will, you can check the robots.txt protocol before crawling. This is just to demonstrate how to use x-crawl.
More
For more detailed documentation, please check: github.com/coder-hxl/x-crawl