Skip to Content
FrontendCrud V2ScraperLegacy Frontend Scraper for CRUD Generator

Legacy Frontend Scraper for CRUD Generator

A sophisticated Node.js-based web scraping tool that bridges the gap between PHPReaction legacy frontend applications and modern CRUD v2 implementations. This scraper extracts UI structure, field definitions, and metadata from legacy systems to enable seamless migration and maintain UI/UX consistency in generated CRUD configurations.


Legacy Migration Architecture

This tool forms a critical component in the PHPReaction ecosystem migration strategy, specifically designed to preserve years of UI/UX refinements while modernizing the underlying architecture. By automating the extraction of legacy frontend patterns, it ensures that the transition to CRUD v2 maintains user familiarity and business logic consistency.

Key Features

  • Automated Web Scraping: Extracts page structures, UI elements, and features automatically
  • Multi-language Support: Handles English and French content extraction
  • Session Management: Login handling with retry mechanisms
  • Structured Data Extraction: Panel navigation, listing pages, detail pages, and action detection
  • Error Handling: Built-in retry mechanisms and graceful error recovery
  • Configurable Output: Organized data structure based on language preference

Legacy-to-Modern Integration Pipeline

The scraper operates as the first stage in a sophisticated three-tier integration pipeline:

  1. Legacy Data Extraction -> CRUD Generator Enhancement -> CRUD v2 Configuration
  2. Scraper Output (scrapedData/) -> Generator Processing (legacy-integration.js) -> Final Config (crud_config/)
  3. UI Preservation -> Metadata Synchronization -> Generated Components

Critical Integration Points:

  • Generator Integration: The legacy-integration.js module in the CRUD generator processes scraped data
  • Field Synchronization: Legacy field order and labels are preserved through processFieldsWithLegacy()
  • Metadata Matching: Entity metadata from API calls is enhanced with scraped UI patterns
  • Configuration Generation: Final CRUD configs maintain legacy UI structure while using modern v2 architecture

Architecture & Technology Stack

Core Dependencies & Technical Stack

{ "puppeteer-core": "^24.1.0", // Chrome DevTools Protocol automation "dotenv": "^16.4.7", // Environment configuration "meow": "^13.2.0", // CLI argument parsing "cli-meow-help": "^4.0.0" // CLI help generation }

Project Structure

phpreaction-legacy-frontend-scraper-for-generator/ ├── index.js # Entry point and main orchestration ├── login.js # Authentication module ├── scraper.js # Core scraping logic and WebScraper class ├── .env # Environment configuration ├── utils/ │ ├── cli.js # CLI argument parsing │ └── functions.js # Utility functions ├── scrapedData/ # Output directory │ ├── en/ # English extracted data │ │ └── [PanelName]Bundle/ │ │ └── [MenuName]/ │ │ └── index.js # Contains listing_data and show_data │ └── fr/ # French extracted data │ └── [PanelName]Bundle/ │ └── [MenuName]/ │ └── index.js ├── package.json ├── README.md └── CHANGELOG.md

WebScraper Class Architecture & Data Flow

The WebScraper class implements a sophisticated extraction pipeline:

class WebScraper { constructor(baseUrl, options = {}) { // Core browser automation setup this.baseUrl = baseUrl; // Target legacy frontend URL this.browser = null; // Puppeteer browser instance this.page = null; // Active browser page // Extraction configuration this.options = { outputDir: "./scrapedData/[language]", // Multi-language output maxRetries: 3, // Resilience handling retryDelay: 1000, // Retry backoff timeout: 500000, // Operation timeout headless: false, // Debug visibility }; } // Key extraction methods: // - processAllPanels(): Navigates accordion menu structure // - extractListingData(): Captures table headers and data patterns // - extractShowPageData(): Extracts detail view field structure // - extractListingFilters(): Captures filter configurations // - extractFormElements(): Scrapes form field definitions }

Installation & Setup

Prerequisites

Ensure you have the following installed:

  • Node.js (Latest LTS version recommended)
  • Chrome/Chromium browser for Puppeteer
  • Network Access to the target legacy frontend

Installation Steps

Clone Repository

git clone https://github.com/PHPCreation/phpreaction-legacy-frontend-scraper-for-generator.git cd phpreaction-legacy-frontend-scraper-for-generator

Install Dependencies

npm install

Environment Configuration

Create or configure your .env file:

# Target URL for scraping SCRAPE_URL=https://your-legacy-frontend.example.com # Authentication credentials AUTH_USERNAME=your-username AUTH_PASSWORD=your-password # Chrome executable path (adjust based on your system) CHROME_EXECUTABLE_PATH=/path/to/chrome # Language preference (en/fr) SCRAPER_LANGUAGE=en

Verify Setup

npm run help

Usage & Operation

Basic Commands

# Run complete scraping process npm start # Equivalent to node index.js

CLI Options

The scraper supports various command-line options:

# Language selection node index.js --language fr # French node index.js --language en # English (default) # Processing modes node index.js --parameters # Parameters only node index.js --bundle Name --url URL # Specific bundle node index.js --url URL # Single URL # Help node index.js --help

Configuration Options

// Scraper configuration options in scraper.js { outputDir: "./scrapedData/[language]", # Output directory maxRetries: 3, # Retry attempts retryDelay: 1000, # Delay between retries (ms) timeout: 500000, # Overall timeout (ms) headless: false # Browser visibility }

Data Extraction Architecture & Methodology

1. Authentication & Session Management

The Login class provides robust authentication for legacy systems:

// login.js - Enterprise-grade authentication class Login { async login(username, password) { // Handles legacy CSRF tokens // Manages session persistence // Implements retry logic for network failures // Supports multi-language login interfaces } }

2. Hierarchical Panel Processing

Sophisticated accordion menu navigation system:

// Core extraction workflow processAllPanels() { // 1. Detect accordion menu structure (#accordion-menu) // 2. Extract panel information (name, icon, state) // 3. Navigate through collapsed/expanded states // 4. Process menu items within each panel // 5. Maintain hierarchical data structure // Output: ./scrapedData/[language]/[PanelName]Bundle/[MenuName]/ }

3. Multi-Dimensional Data Extraction

Legacy UI Pattern Recognition

The scraper identifies and preserves critical UI patterns:

// Listing page extraction (table.styled-list) const listingData = [ "Id", // Column headers in display order "Account", "Ref", "Type", // Preserves legacy field ordering "Parent", ]; // Filter extraction (#panel-filters) const listingFilters = [ { id: "tag", label: "Tags", // Original legacy labels filterType: "multiSelect", inputType: "text", // UI component type isMultiSelect: true, }, ]; // Detail view extraction (.show #show-fields) const showData = [ { title: "Basic Information", // Section groupings fields: ["fieldName1", "fieldName2"], // Field organization }, ];

Advanced Component Recognition

Enterprise-Level Pattern Extraction

  • Sidebox Recognition: Captures related entity panels and relationships
  • Action Button Mapping: Identifies view/edit/delete action patterns
  • Form Field Extraction: Preserves input types, validations, and structure
  • Multi-language Content: Maintains translations and locale-specific configurations
  • Custom UI Elements: Recognizes specialized components and their configurations

4. Structured Data Output & Generator Integration

The scraper produces structured data files that directly integrate with the CRUD generator’s legacy processing system:

// Generated index.js files in scrapedData structure export const listingData = [ "Id", // Field order preserved from legacy UI "Account", "Ref", "Type", ]; export const listingFilters = [ { id: "tag", label: "Tags", // Original UI labels filterType: "multiSelect", // Component type mapping inputType: "text", isMultiSelect: true, }, ]; export const showData = [ { title: "Basic Information", // Section organization fields: ["fieldName1"], // Field structure }, ]; export const formData = { // Form field configurations groupName: [ { label: "Field Label", type: "input_type", // Legacy input patterns }, ], }; export const sideboxData = [ { // Related entity panels title: "Related Items", icon: "glyphicons-list", panelId: "#related-panel", }, ];

Generator Integration Process:

// In CRUD generator's legacy-integration.js const legacyData = findLegacyData(entityName); const syncedFields = processFieldsWithLegacy( fields, legacyData.listingData, "listing" ); // Result: Metadata fields reordered to match legacy UI patterns

Complete Integration Workflow & Data Flow

1. Three-Tier Legacy Migration Pipeline

Stage 1: Legacy Data Extraction (Scraper)

Input: PHPReaction legacy frontend URL + credentials Process: Puppeteer-based navigation and DOM extraction Output: scrapedData/[lang]/[Bundle]/[Entity]/index.js

// Scraper extracts: - Panel structure (#accordion-menu) - Listing headers (table.styled-list thead) - Filter configurations (#panel-filters) - Detail view sections (.show #show-fields) - Form field patterns (edit pages) - Action button mappings (.btn-action)

Stage 2: Metadata Enhancement (Generator)

Input: API metadata + scraped legacy data
Process: Field synchronization via legacy-integration.js Output: Enhanced entity configurations

// Generator processes: const legacyData = findLegacyData("AccountingBundle\\\\Account"); const orderedFields = syncFields(apiFields, legacyData.listingData, "listing"); // Result: API fields reordered to match legacy UI sequence

Stage 3: CRUD v2 Configuration (Final Output)

Input: Enhanced configurations Process: Template-based generation
Output: crud_config/[entity]/index.tsx

// Final generated config maintains legacy order: const mainColumns: MainColumnsListing[] = [ { title: "Id", key: "id", sortable: true }, // Legacy order preserved { title: "Account", key: "account", sortable: true }, { title: "Ref", key: "ref", sortable: true }, ];

2. Data Synchronization Mechanisms

The integration employs sophisticated field matching algorithms:

// Field normalization for matching function normalizeName(name) { return name .toLowerCase() .replace(/s?bundle$/, "") // Remove bundle suffix .replace(/\s+code$/i, "code") // Normalize code fields .replace(/^(app|phpreaction)entity/, "") // Remove entity prefixes .replace(/\s+/g, ""); // Remove whitespace } // Synchronization process function syncFields(fields, legacyFields, type) { // 1. Create normalized name mappings // 2. Match API fields to legacy field order // 3. Preserve legacy labels and titles // 4. Maintain original UI structure return orderedFields; // Fields in legacy UI order }

3. Multi-Language Support Integration

Handles legacy systems with multiple language interfaces:

# Scraper output structure scrapedData/ ├── en/ # English interface extraction └── AccountingBundle/ └── Accounts/ └── index.js # English field labels └── fr/ # French interface extraction └── AccountingBundle/ └── Accounts/ └── index.js # French field labels

Legacy Preservation Strategy

The integration workflow ensures that:

  • Field Order: Maintained from legacy listing tables
  • Label Consistency: Original UI text preserved
  • Filter Logic: Legacy filter patterns mapped to v2 components
  • Section Organization: Detail view groupings maintained
  • Action Patterns: Button configurations transferred

Legacy Data Structure & Generator Integration

Hierarchical Output Organization

The scraper organizes extracted data to mirror legacy frontend structure:

scrapedData/ ├── en/ # English UI extraction │ ├── AccountingBundle/ # Legacy panel grouping │ │ ├── Accounts/ │ │ │ └── index.js # Account entity data │ │ ├── Entries/ │ │ │ └── index.js # Entries entity data │ │ └── Transactions/ │ │ └── index.js │ ├── BillsBundle/ # Billing module │ │ ├── Bills/ │ │ └── Payments/ │ ├── entityList/ # Cross-bundle entity registry │ │ └── EntityList.js # All discovered entities │ └── schema.js # Complete menu structure └── fr/ # French UI extraction └── [Identical structure with French labels]

Generator Integration Points

The CRUD generator locates and processes scraped data through multiple strategies:

// Generator's legacy-integration.js function findLegacyData(entityName) { // 1. Parse entity name: 'AccountingBundle\\\\Account' const entity = "Account"; // Extract entity name const bundle = "AccountingBundle"; // Extract bundle name // 2. Normalize for file system matching const normalizedEntity = normalizeName(entity); // 'account' const normalizedBundle = normalizeName(bundle); // 'accounting' // 3. Search scraped data structure const bundleFolder = "./scrapedLegacy/AccountingBundle"; const entityFolder = findEntityFolder(bundleFolder, normalizedEntity); // 4. Load and parse legacy data const legacyContent = fs.readFileSync(entityFolder + "/index.js"); return parseLegacyData(legacyContent); // Extract exported constants }

Actual Scraped Data Format (Generator-Compatible)

Each index.js contains precisely structured exports that the generator’s legacy-integration.js parses:

// Real scraped data format from legacy frontend export const listingData = [ "Id", // Simple array of column headers "Account", // Extracted from table.styled-list thead th "Ref", "Type", "Parent", // Order preserved from legacy UI ]; export const listingFilters = [ { id: "tag", // Filter field identifier label: "Tags", // Original legacy label filterType: "multiSelect", // UI component type inputType: "text", // Input method isMultiSelect: true, // Selection behavior placeholder: null, // UI placeholder dataFetchUrl: null, // Dynamic data source }, { id: "disabledAt", label: "Enabled", // Legacy terminology preserved filterType: "select", inputType: "text", isMultiSelect: false, }, ]; export const showData = [ { title: "Basic Information", // Section from .show #show-fields h2 fields: [ // Fields from .field-name elements "fieldName1", "fieldName2", ], }, { title: "Advanced Settings", // Multiple sections preserved fields: ["advancedField1", "advancedField2"], }, ]; export const formData = { // Form structure from edit pages group1: [ { // Form field groupings label: "Account Type", // Original form labels type: "select", // Input type detection }, ], }; export const sideboxData = [ { // Related entity panels title: "Related Transactions", // Panel titles icon: "glyphicons-list", // Icon classes panelId: "#related-panel", // DOM references }, ];

Generator Processing:

// How generator uses this data function processFieldsWithLegacy(entityName, fields, type) { const legacyData = findLegacyData(entityName); switch (type) { case "listing": // Uses legacyData.listingData array to reorder API fields return syncFields(fields, legacyData.listingData, "listing"); case "form": // Uses legacyData.formData object for form field organization return syncFields( fields, Object.values(legacyData.formData).flat(), "form" ); case "show": // Uses legacyData.showData sections for detail view structure const legacyFields = legacyData.showData.flatMap( (section) => section.fields ); return syncFields(fields, legacyFields, "show"); } }

Error Handling & Troubleshooting

Common Issues & Solutions

1. Chrome Path Issues

Problem: Chrome executable not found

Solutions:

  • Verify Chrome installation: which google-chrome or where chrome.exe
  • Update CHROME_EXECUTABLE_PATH in .env
  • Install Chrome/Chromium if missing

2. Authentication Failures

# Common authentication issues: # 1. Invalid credentials # 2. Network connectivity issues # 3. CSRF token problems # 4. Session timeout # Solutions: # - Verify credentials in .env # - Check network connectivity # - Clear browser cache/cookies # - Check for CSRF error messages in PHPR listing pages

3. Language Switch Failures

  • Issue: Language switching not working
  • Solution: Verify supported languages in target application
  • Check: Language parameter in CLI arguments

4. Scraping Timeouts

// Adjust timeout settings in scraper.js { timeout: 500000, // Overall timeout (increase if needed) maxRetries: 3, // Retry attempts retryDelay: 1000 // Delay between retries }

Debugging Options

# Enable verbose logging node index.js --verbose # Run with browser visible for debugging # Set headless: false in scraper.js options # Test specific URL node index.js --url "https://example.com/debug-page"

Development & Maintenance

Development Workflow

The project follows standard Node.js development practices:

{ "scripts": { "start": "node index.js", "help": "node index.js --help", "start:parameters": "node index.js --parameters", "prepare": "husky install", "commit": "git-cz" } }

Production Guidelines & Legacy Migration Strategy

Enterprise Deployment Best Practices

Production Scraping Guidelines

  1. Legacy System Access: Coordinate with system administrators for stable access windows
  2. Multi-Language Extraction: Run separate scraping sessions for each locale (en/fr)
  3. Data Validation Pipeline: Implement automated validation of scraped data completeness
  4. Version Control Integration: Commit scraped data changes to track legacy system evolution
  5. Generator Synchronization: Ensure scraped data format matches generator expectations
  6. Incremental Updates: Re-scrape only changed entities rather than full system scraping

Performance Optimization

// Optimized scraper configuration { headless: true, // For production runs timeout: 300000, // Reasonable timeout maxRetries: 2, // Avoid excessive retries concurrent: false # Avoid overwhelming target server }

Security Considerations

  • Credential Management: Use environment variables for sensitive data
  • Rate Limiting: Implement delays between requests
  • Error Handling: Don’t expose sensitive information in logs
  • Network Security: Use HTTPS connections where possible

Real-World Integration Examples

Complete Legacy Migration Workflow

# Step 1: Extract legacy frontend structure cd phpreaction-legacy-frontend-scraper-for-generator SCRAPER_LANGUAGE=en npm start # Output: scrapedData/en/[Bundle]/[Entity]/index.js files # Step 2: Copy scraped data to generator cp -r scrapedData/* ../frontend-components-crud-react-generator/scrapedLegacy/ # Step 3: Generate CRUD config with legacy integration cd ../frontend-components-crud-react-generator node index.js --entityName "AccountingBundle\\\\Account" \ --bundleCrud "phprCrud" \ --resourceName "accounts" \ --outputPath "./generated/accounts/index.tsx" # Step 4: Deploy to CRUD v2 project cp generated/accounts/index.tsx ../phpreaction-frontend-crud-react-v2/src/crud_config/accounts/

Generator Integration Deep Dive

How the generator processes scraped legacy data:

// In ticket-process.js - generateConfiguration() const originalListingFields = readJsonFile(listingFieldsPath); // From API metadata const allFields = readJsonFile(fieldsPath); // From API metadata // Legacy integration occurs here: let listingLegacyFields = processFieldsWithLegacy( entityName, // 'AccountingBundle\\\\Account' listingFields, // API-provided fields "listing" // Processing type ); // Result: API fields reordered to match legacy UI sequence // Example: ['id', 'name', 'type'] becomes ['name', 'type', 'id'] if legacy UI shows this order // Final config generation AddMainColumns(filteredAllFields, outputPath); // Uses reordered fields AddFormInputs(entityName, allFields, requiredFields, outputPath); AddFilterFields(entityName, requiredSelectedFields, resourceName, outputPath);

CRUD v2 Generated Configuration

Final output maintains legacy patterns:

// Generated crud_config/accounts/index.tsx const mainColumns: MainColumnsListing[] = [ { title: "Account", // Legacy label preserved key: "account", // API field mapped sortable: true, }, { title: "Ref", // Legacy order maintained key: "ref", sortable: true, }, { title: "Type", // Follows legacy sequence key: "type", sortable: true, }, ]; // Order matches scraped listingData: ["Account", "Ref", "Type"]

Support & Resources

Documentation Resources


This documentation provides comprehensive coverage of the legacy frontend scraper’s role in the PHPReaction migration ecosystem, detailing its sophisticated data extraction mechanisms, integration with the CRUD generator’s legacy processing pipeline, and the complete workflow from legacy UI preservation to modern CRUD v2 configuration generation.

Last updated on