Legacy Frontend Scraper for CRUD Generator

GitHub Repository GitHub Project Board

A sophisticated Node.js-based web scraping tool that bridges the gap between PHPReaction legacy frontend applications and modern CRUD v2 implementations. This scraper extracts UI structure, field definitions, and metadata from legacy systems to enable seamless migration and maintain UI/UX consistency in generated CRUD configurations.

Legacy Migration Architecture

This tool forms a critical component in the PHPReaction ecosystem migration strategy, specifically designed to preserve years of UI/UX refinements while modernizing the underlying architecture. By automating the extraction of legacy frontend patterns, it ensures that the transition to CRUD v2 maintains user familiarity and business logic consistency.

Key Features

Automated Web Scraping: Extracts page structures, UI elements, and features automatically
Multi-language Support: Handles English and French content extraction
Session Management: Login handling with retry mechanisms
Structured Data Extraction: Panel navigation, listing pages, detail pages, and action detection
Error Handling: Built-in retry mechanisms and graceful error recovery
Configurable Output: Organized data structure based on language preference

Legacy-to-Modern Integration Pipeline

The scraper operates as the first stage in a sophisticated three-tier integration pipeline:

Legacy Data Extraction -> CRUD Generator Enhancement -> CRUD v2 Configuration
Scraper Output (scrapedData/) -> Generator Processing (legacy-integration.js) -> Final Config (crud_config/)
UI Preservation -> Metadata Synchronization -> Generated Components

Critical Integration Points:

Generator Integration: The legacy-integration.js module in the CRUD generator processes scraped data
Field Synchronization: Legacy field order and labels are preserved through processFieldsWithLegacy()
Metadata Matching: Entity metadata from API calls is enhanced with scraped UI patterns
Configuration Generation: Final CRUD configs maintain legacy UI structure while using modern v2 architecture

Architecture & Technology Stack

Core Dependencies & Technical Stack


{
  "puppeteer-core": "^24.1.0", // Chrome DevTools Protocol automation
  "dotenv": "^16.4.7", // Environment configuration
  "meow": "^13.2.0", // CLI argument parsing
  "cli-meow-help": "^4.0.0" // CLI help generation
}

Project Structure


phpreaction-legacy-frontend-scraper-for-generator/
├── index.js                    # Entry point and main orchestration
├── login.js                    # Authentication module
├── scraper.js                  # Core scraping logic and WebScraper class
├── .env                        # Environment configuration
├── utils/
│   ├── cli.js                 # CLI argument parsing
│   └── functions.js           # Utility functions
├── scrapedData/               # Output directory
│   ├── en/                    # English extracted data
│   │   └── [PanelName]Bundle/
│   │       └── [MenuName]/
│   │           └── index.js   # Contains listing_data and show_data
│   └── fr/                    # French extracted data
│       └── [PanelName]Bundle/
│           └── [MenuName]/
│               └── index.js
├── package.json
├── README.md
└── CHANGELOG.md

WebScraper Class Architecture & Data Flow

The WebScraper class implements a sophisticated extraction pipeline:


class WebScraper {
  constructor(baseUrl, options = {}) {
    // Core browser automation setup
    this.baseUrl = baseUrl; // Target legacy frontend URL
    this.browser = null; // Puppeteer browser instance
    this.page = null; // Active browser page
 
    // Extraction configuration
    this.options = {
      outputDir: "./scrapedData/[language]", // Multi-language output
      maxRetries: 3, // Resilience handling
      retryDelay: 1000, // Retry backoff
      timeout: 500000, // Operation timeout
      headless: false, // Debug visibility
    };
  }
 
  // Key extraction methods:
  // - processAllPanels(): Navigates accordion menu structure
  // - extractListingData(): Captures table headers and data patterns
  // - extractShowPageData(): Extracts detail view field structure
  // - extractListingFilters(): Captures filter configurations
  // - extractFormElements(): Scrapes form field definitions
}

Installation & Setup

Prerequisites

Ensure you have the following installed:

Node.js (Latest LTS version recommended)
Chrome/Chromium browser for Puppeteer
Network Access to the target legacy frontend

Installation Steps

Clone Repository


git clone https://github.com/PHPCreation/phpreaction-legacy-frontend-scraper-for-generator.git
cd phpreaction-legacy-frontend-scraper-for-generator

Install Dependencies


npm install

Environment Configuration

Create or configure your .env file:


# Target URL for scraping
SCRAPE_URL=https://your-legacy-frontend.example.com
 
# Authentication credentials
AUTH_USERNAME=your-username
AUTH_PASSWORD=your-password
 
# Chrome executable path (adjust based on your system)
CHROME_EXECUTABLE_PATH=/path/to/chrome
 
# Language preference (en/fr)
SCRAPER_LANGUAGE=en

Verify Setup


npm run help

Usage & Operation

Basic Commands

Standard Scraping


# Run complete scraping process
npm start
 
# Equivalent to
 
node index.js

Parameter Processing


# Process parameters only
npm run start:parameters
 
# Equivalent to
node index.js --parameters

Bundle Specific


# Process specific bundle with URL
node index.js --bundle BundleName --url "https://example.com/specific-url"

URL Specific


# Process single URL
node index.js --url "https://example.com/target-page"

CLI Options

The scraper supports various command-line options:


# Language selection
node index.js --language fr          # French
node index.js --language en          # English (default)
 
# Processing modes
node index.js --parameters           # Parameters only
node index.js --bundle Name --url URL # Specific bundle
node index.js --url URL              # Single URL
 
# Help
node index.js --help

Configuration Options


// Scraper configuration options in scraper.js
{
  outputDir: "./scrapedData/[language]",    # Output directory
  maxRetries: 3,                            # Retry attempts
  retryDelay: 1000,                         # Delay between retries (ms)
  timeout: 500000,                          # Overall timeout (ms)
  headless: false                           # Browser visibility
}

Data Extraction Architecture & Methodology

1. Authentication & Session Management

The Login class provides robust authentication for legacy systems:


// login.js - Enterprise-grade authentication
class Login {
  async login(username, password) {
    // Handles legacy CSRF tokens
    // Manages session persistence
    // Implements retry logic for network failures
    // Supports multi-language login interfaces
  }
}

2. Hierarchical Panel Processing

Sophisticated accordion menu navigation system:


// Core extraction workflow
processAllPanels() {
  // 1. Detect accordion menu structure (#accordion-menu)
  // 2. Extract panel information (name, icon, state)
  // 3. Navigate through collapsed/expanded states
  // 4. Process menu items within each panel
  // 5. Maintain hierarchical data structure
 
  // Output: ./scrapedData/[language]/[PanelName]Bundle/[MenuName]/
}

3. Multi-Dimensional Data Extraction

Legacy UI Pattern Recognition

The scraper identifies and preserves critical UI patterns:


// Listing page extraction (table.styled-list)
const listingData = [
  "Id", // Column headers in display order
  "Account",
  "Ref",
  "Type", // Preserves legacy field ordering
  "Parent",
];
 
// Filter extraction (#panel-filters)
const listingFilters = [
  {
    id: "tag",
    label: "Tags", // Original legacy labels
    filterType: "multiSelect",
    inputType: "text", // UI component type
    isMultiSelect: true,
  },
];
 
// Detail view extraction (.show #show-fields)
const showData = [
  {
    title: "Basic Information", // Section groupings
    fields: ["fieldName1", "fieldName2"], // Field organization
  },
];

Advanced Component Recognition

Enterprise-Level Pattern Extraction

Sidebox Recognition: Captures related entity panels and relationships
Action Button Mapping: Identifies view/edit/delete action patterns
Form Field Extraction: Preserves input types, validations, and structure
Multi-language Content: Maintains translations and locale-specific configurations
Custom UI Elements: Recognizes specialized components and their configurations

4. Structured Data Output & Generator Integration

The scraper produces structured data files that directly integrate with the CRUD generator’s legacy processing system:


// Generated index.js files in scrapedData structure
export const listingData = [
  "Id", // Field order preserved from legacy UI
  "Account",
  "Ref",
  "Type",
];
 
export const listingFilters = [
  {
    id: "tag",
    label: "Tags", // Original UI labels
    filterType: "multiSelect", // Component type mapping
    inputType: "text",
    isMultiSelect: true,
  },
];
 
export const showData = [
  {
    title: "Basic Information", // Section organization
    fields: ["fieldName1"], // Field structure
  },
];
 
export const formData = {
  // Form field configurations
  groupName: [
    {
      label: "Field Label",
      type: "input_type", // Legacy input patterns
    },
  ],
};
 
export const sideboxData = [
  {
    // Related entity panels
    title: "Related Items",
    icon: "glyphicons-list",
    panelId: "#related-panel",
  },
];

Generator Integration Process:


// In CRUD generator's legacy-integration.js
const legacyData = findLegacyData(entityName);
const syncedFields = processFieldsWithLegacy(
  fields,
  legacyData.listingData,
  "listing"
);
// Result: Metadata fields reordered to match legacy UI patterns

Complete Integration Workflow & Data Flow

1. Three-Tier Legacy Migration Pipeline

Stage 1: Legacy Data Extraction (Scraper)

Input: PHPReaction legacy frontend URL + credentials Process: Puppeteer-based navigation and DOM extraction Output: scrapedData/[lang]/[Bundle]/[Entity]/index.js


// Scraper extracts:
- Panel structure (#accordion-menu)
- Listing headers (table.styled-list thead)
- Filter configurations (#panel-filters)
- Detail view sections (.show #show-fields)
- Form field patterns (edit pages)
- Action button mappings (.btn-action)

Stage 2: Metadata Enhancement (Generator)

Input: API metadata + scraped legacy data
Process: Field synchronization via legacy-integration.js Output: Enhanced entity configurations


// Generator processes:
const legacyData = findLegacyData("AccountingBundle\\\\Account");
const orderedFields = syncFields(apiFields, legacyData.listingData, "listing");
// Result: API fields reordered to match legacy UI sequence

Stage 3: CRUD v2 Configuration (Final Output)

Input: Enhanced configurations Process: Template-based generation
Output: crud_config/[entity]/index.tsx


// Final generated config maintains legacy order:
const mainColumns: MainColumnsListing[] = [
  { title: "Id", key: "id", sortable: true }, // Legacy order preserved
  { title: "Account", key: "account", sortable: true },
  { title: "Ref", key: "ref", sortable: true },
];

2. Data Synchronization Mechanisms

The integration employs sophisticated field matching algorithms:


// Field normalization for matching
function normalizeName(name) {
  return name
    .toLowerCase()
    .replace(/s?bundle$/, "") // Remove bundle suffix
    .replace(/\s+code$/i, "code") // Normalize code fields
    .replace(/^(app|phpreaction)entity/, "") // Remove entity prefixes
    .replace(/\s+/g, ""); // Remove whitespace
}
 
// Synchronization process
function syncFields(fields, legacyFields, type) {
  // 1. Create normalized name mappings
  // 2. Match API fields to legacy field order
  // 3. Preserve legacy labels and titles
  // 4. Maintain original UI structure
  return orderedFields; // Fields in legacy UI order
}

3. Multi-Language Support Integration

Handles legacy systems with multiple language interfaces:


# Scraper output structure
scrapedData/
├── en/                     # English interface extraction
│   └── AccountingBundle/
│       └── Accounts/
│           └── index.js    # English field labels
└── fr/                     # French interface extraction
    └── AccountingBundle/
        └── Accounts/
            └── index.js    # French field labels

Legacy Preservation Strategy

The integration workflow ensures that:

Field Order: Maintained from legacy listing tables
Label Consistency: Original UI text preserved
Filter Logic: Legacy filter patterns mapped to v2 components
Section Organization: Detail view groupings maintained
Action Patterns: Button configurations transferred

Legacy Data Structure & Generator Integration

Hierarchical Output Organization

The scraper organizes extracted data to mirror legacy frontend structure:


scrapedData/
├── en/                                    # English UI extraction
│   ├── AccountingBundle/                   # Legacy panel grouping
│   │   ├── Accounts/
│   │   │   └── index.js                   # Account entity data
│   │   ├── Entries/
│   │   │   └── index.js                   # Entries entity data
│   │   └── Transactions/
│   │       └── index.js
│   ├── BillsBundle/                       # Billing module
│   │   ├── Bills/
│   │   └── Payments/
│   ├── entityList/                        # Cross-bundle entity registry
│   │   └── EntityList.js                  # All discovered entities
│   └── schema.js                          # Complete menu structure
└── fr/                                    # French UI extraction
    └── [Identical structure with French labels]

Generator Integration Points

The CRUD generator locates and processes scraped data through multiple strategies:


// Generator's legacy-integration.js
function findLegacyData(entityName) {
  // 1. Parse entity name: 'AccountingBundle\\\\Account'
  const entity = "Account"; // Extract entity name
  const bundle = "AccountingBundle"; // Extract bundle name
 
  // 2. Normalize for file system matching
  const normalizedEntity = normalizeName(entity); // 'account'
  const normalizedBundle = normalizeName(bundle); // 'accounting'
 
  // 3. Search scraped data structure
  const bundleFolder = "./scrapedLegacy/AccountingBundle";
  const entityFolder = findEntityFolder(bundleFolder, normalizedEntity);
 
  // 4. Load and parse legacy data
  const legacyContent = fs.readFileSync(entityFolder + "/index.js");
  return parseLegacyData(legacyContent); // Extract exported constants
}

Actual Scraped Data Format (Generator-Compatible)

Each index.js contains precisely structured exports that the generator’s legacy-integration.js parses:


// Real scraped data format from legacy frontend
export const listingData = [
  "Id", // Simple array of column headers
  "Account", // Extracted from table.styled-list thead th
  "Ref",
  "Type",
  "Parent", // Order preserved from legacy UI
];
 
export const listingFilters = [
  {
    id: "tag", // Filter field identifier
    label: "Tags", // Original legacy label
    filterType: "multiSelect", // UI component type
    inputType: "text", // Input method
    isMultiSelect: true, // Selection behavior
    placeholder: null, // UI placeholder
    dataFetchUrl: null, // Dynamic data source
  },
  {
    id: "disabledAt",
    label: "Enabled", // Legacy terminology preserved
    filterType: "select",
    inputType: "text",
    isMultiSelect: false,
  },
];
 
export const showData = [
  {
    title: "Basic Information", // Section from .show #show-fields h2
    fields: [
      // Fields from .field-name elements
      "fieldName1",
      "fieldName2",
    ],
  },
  {
    title: "Advanced Settings", // Multiple sections preserved
    fields: ["advancedField1", "advancedField2"],
  },
];
 
export const formData = {
  // Form structure from edit pages
  group1: [
    {
      // Form field groupings
      label: "Account Type", // Original form labels
      type: "select", // Input type detection
    },
  ],
};
 
export const sideboxData = [
  {
    // Related entity panels
    title: "Related Transactions", // Panel titles
    icon: "glyphicons-list", // Icon classes
    panelId: "#related-panel", // DOM references
  },
];

Generator Processing:


// How generator uses this data
function processFieldsWithLegacy(entityName, fields, type) {
  const legacyData = findLegacyData(entityName);
 
  switch (type) {
    case "listing":
      // Uses legacyData.listingData array to reorder API fields
      return syncFields(fields, legacyData.listingData, "listing");
    case "form":
      // Uses legacyData.formData object for form field organization
      return syncFields(
        fields,
        Object.values(legacyData.formData).flat(),
        "form"
      );
    case "show":
      // Uses legacyData.showData sections for detail view structure
      const legacyFields = legacyData.showData.flatMap(
        (section) => section.fields
      );
      return syncFields(fields, legacyFields, "show");
  }
}

Error Handling & Troubleshooting

Common Issues & Solutions

1. Chrome Path Issues

Problem: Chrome executable not found

Solutions:

Verify Chrome installation: which google-chrome or where chrome.exe
Update CHROME_EXECUTABLE_PATH in .env
Install Chrome/Chromium if missing

2. Authentication Failures


# Common authentication issues:
# 1. Invalid credentials
# 2. Network connectivity issues
# 3. CSRF token problems
# 4. Session timeout
 
# Solutions:
# - Verify credentials in .env
# - Check network connectivity
# - Clear browser cache/cookies
# - Check for CSRF error messages in PHPR listing pages

3. Language Switch Failures

Issue: Language switching not working
Solution: Verify supported languages in target application
Check: Language parameter in CLI arguments

4. Scraping Timeouts


// Adjust timeout settings in scraper.js
{
  timeout: 500000,        // Overall timeout (increase if needed)
  maxRetries: 3,          // Retry attempts
  retryDelay: 1000        // Delay between retries
}

Debugging Options


# Enable verbose logging
node index.js --verbose
 
# Run with browser visible for debugging
# Set headless: false in scraper.js options
 
# Test specific URL
node index.js --url "https://example.com/debug-page"

Development & Maintenance

Development Workflow

The project follows standard Node.js development practices:


{
  "scripts": {
    "start": "node index.js",
    "help": "node index.js --help",
    "start:parameters": "node index.js --parameters",
    "prepare": "husky install",
    "commit": "git-cz"
  }
}

Production Guidelines & Legacy Migration Strategy

Enterprise Deployment Best Practices

Production Scraping Guidelines

Legacy System Access: Coordinate with system administrators for stable access windows
Multi-Language Extraction: Run separate scraping sessions for each locale (en/fr)
Data Validation Pipeline: Implement automated validation of scraped data completeness
Version Control Integration: Commit scraped data changes to track legacy system evolution
Generator Synchronization: Ensure scraped data format matches generator expectations
Incremental Updates: Re-scrape only changed entities rather than full system scraping

Performance Optimization


// Optimized scraper configuration
{
  headless: true,           // For production runs
  timeout: 300000,          // Reasonable timeout
  maxRetries: 2,            // Avoid excessive retries
  concurrent: false         # Avoid overwhelming target server
}

Security Considerations

Credential Management: Use environment variables for sensitive data
Rate Limiting: Implement delays between requests
Error Handling: Don’t expose sensitive information in logs
Network Security: Use HTTPS connections where possible

Real-World Integration Examples

Complete Legacy Migration Workflow


# Step 1: Extract legacy frontend structure
cd phpreaction-legacy-frontend-scraper-for-generator
SCRAPER_LANGUAGE=en npm start
# Output: scrapedData/en/[Bundle]/[Entity]/index.js files
 
# Step 2: Copy scraped data to generator
cp -r scrapedData/* ../frontend-components-crud-react-generator/scrapedLegacy/
 
# Step 3: Generate CRUD config with legacy integration
cd ../frontend-components-crud-react-generator
node index.js --entityName "AccountingBundle\\\\Account" \
              --bundleCrud "phprCrud" \
              --resourceName "accounts" \
              --outputPath "./generated/accounts/index.tsx"
 
# Step 4: Deploy to CRUD v2 project
cp generated/accounts/index.tsx ../phpreaction-frontend-crud-react-v2/src/crud_config/accounts/

Generator Integration Deep Dive

How the generator processes scraped legacy data:


// In ticket-process.js - generateConfiguration()
const originalListingFields = readJsonFile(listingFieldsPath); // From API metadata
const allFields = readJsonFile(fieldsPath); // From API metadata
 
// Legacy integration occurs here:
let listingLegacyFields = processFieldsWithLegacy(
  entityName, // 'AccountingBundle\\\\Account'
  listingFields, // API-provided fields
  "listing" // Processing type
);
 
// Result: API fields reordered to match legacy UI sequence
// Example: ['id', 'name', 'type'] becomes ['name', 'type', 'id'] if legacy UI shows this order
 
// Final config generation
AddMainColumns(filteredAllFields, outputPath); // Uses reordered fields
AddFormInputs(entityName, allFields, requiredFields, outputPath);
AddFilterFields(entityName, requiredSelectedFields, resourceName, outputPath);

CRUD v2 Generated Configuration

Final output maintains legacy patterns:


// Generated crud_config/accounts/index.tsx
const mainColumns: MainColumnsListing[] = [
  {
    title: "Account", // Legacy label preserved
    key: "account", // API field mapped
    sortable: true,
  },
  {
    title: "Ref", // Legacy order maintained
    key: "ref",
    sortable: true,
  },
  {
    title: "Type", // Follows legacy sequence
    key: "type",
    sortable: true,
  },
];
// Order matches scraped listingData: ["Account", "Ref", "Type"]

Support & Resources

Documentation Resources

Project Repository: GitHub Repository
Issue Tracking: GitHub Issues
Related Projects:
- CRUD Generator v2
- CRUD v2 Projects

This documentation provides comprehensive coverage of the legacy frontend scraper’s role in the PHPReaction migration ecosystem, detailing its sophisticated data extraction mechanisms, integration with the CRUD generator’s legacy processing pipeline, and the complete workflow from legacy UI preservation to modern CRUD v2 configuration generation.