A comprehensive system for scraping and analyzing tender data from ezamowienia.gov.pl, with support for multiple scraping methods and AI-powered analysis.
- Node.js (v14 or higher)
- MongoDB
- Chrome/Chromium browser
- OpenAI API key (for analysis features)
- Clone the repository:
git clone <repository-url>
cd tender-scraper- Install dependencies:
npm install- Create a
.envfile:
MONGO_URL=mongodb://localhost:27017
MONGO_DB=tenders_db
OPENAI_API_KEY=your_openai_api_key
CHROME_PATH=/path/to/chrome # Optional- Puppeteer Scraper (Default)
# Run with headless browser
node index.js normal --server
# Run with visible browser
node index.js normal --presentation- XHR API Scraper
node index.js xhr- Official API Scraper
node index.js api- Details Processing
# Run with details processing
node index.js normal --with-details
# Run only details processing
node index.js normal --only-details
# Run details with visible browser
node index.js normal --only-details --presentation- Correction Processing
# Run correction processor
node index.js normal --correctionIn config.js:
puppeteer: {
launch: {
headless: true/false, // Control browser visibility
defaultViewport: null,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--window-size=1920,1080'
]
}
}tender_listings_{scraper_type}: Basic tender informationtender_details: Detailed tender informationtender_analysis: AI-processed analysis results
sequenceDiagram
participant CLI as Command Line
participant App as Main Application
participant PS as Puppeteer Scraper
participant AS as API Scraper
participant OAS as Official API Scraper
participant DS as Details Scraper
participant CP as Correction Processor
participant DB as MongoDB
participant GPT as OpenAI GPT
CLI->>App: Start Command (normal/xhr/api)
alt normal mode
App->>PS: Start Scraping
PS->>DB: Save Listings
else xhr mode
App->>AS: Start Scraping
AS->>DB: Save Listings
else api mode
App->>OAS: Start Scraping
OAS->>DB: Save Listings
end
alt --with-details flag
App->>DS: Start Details Processing
DS->>DB: Get Unprocessed Listings
loop For each listing
DS->>DS: Initialize Browser
DS->>DS: Setup UI
DS->>GPT: Analyze Content
DS->>DB: Save Analysis
DS->>DS: Cleanup Browser
end
end
alt --correction flag
App->>CP: Start Correction
CP->>DB: Get Tender Details
CP->>GPT: Analyze Details
CP->>DB: Save Corrected Data
end
App->>CLI: Process Complete
- Multiple scraping methods (Puppeteer, XHR, Official API)
- AI-powered tender analysis using OpenAI GPT
- Automatic browser user-agent rotation
- Detailed logging system
- Visual progress tracking for details processing
- Error recovery and retry mechanisms
Logs are stored in the logs directory:
combined.log: All log levelserror.log: Error-level logs only
The system includes:
- Automatic retry mechanisms
- Browser session recovery
- Connection error handling
- Process cleanup on errors
To modify the system:
- Browser Configuration
// Modify browser settings in initBrowser()
async initBrowser() {
this.browser = await puppeteer.launch({
headless: config.puppeteer.launch.headless,
defaultViewport: { width: 1920, height: 1080 },
// ... other settings
});
}- User Agent Rotation
// Use random user agent
await this.page.setUserAgent(getRandomUserAgent());- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License.