Legacy Frontend Scraper for CRUD Generator
A sophisticated Node.js-based web scraping tool that bridges the gap between PHPReaction legacy frontend applications and modern CRUD v2 implementations. This scraper extracts UI structure, field definitions, and metadata from legacy systems to enable seamless migration and maintain UI/UX consistency in generated CRUD configurations.
Legacy Migration Architecture
This tool forms a critical component in the PHPReaction ecosystem migration strategy, specifically designed to preserve years of UI/UX refinements while modernizing the underlying architecture. By automating the extraction of legacy frontend patterns, it ensures that the transition to CRUD v2 maintains user familiarity and business logic consistency.
Key Features
- Automated Web Scraping: Extracts page structures, UI elements, and features automatically
- Multi-language Support: Handles English and French content extraction
- Session Management: Login handling with retry mechanisms
- Structured Data Extraction: Panel navigation, listing pages, detail pages, and action detection
- Error Handling: Built-in retry mechanisms and graceful error recovery
- Configurable Output: Organized data structure based on language preference
Legacy-to-Modern Integration Pipeline
The scraper operates as the first stage in a sophisticated three-tier integration pipeline:
- Legacy Data Extraction -> CRUD Generator Enhancement -> CRUD v2 Configuration
- Scraper Output (
scrapedData/) -> Generator Processing (legacy-integration.js) -> Final Config (crud_config/) - UI Preservation -> Metadata Synchronization -> Generated Components
Critical Integration Points:
- Generator Integration: The
legacy-integration.jsmodule in the CRUD generator processes scraped data - Field Synchronization: Legacy field order and labels are preserved through
processFieldsWithLegacy() - Metadata Matching: Entity metadata from API calls is enhanced with scraped UI patterns
- Configuration Generation: Final CRUD configs maintain legacy UI structure while using modern v2 architecture
Architecture & Technology Stack
Core Dependencies & Technical Stack
{
"puppeteer-core": "^24.1.0", // Chrome DevTools Protocol automation
"dotenv": "^16.4.7", // Environment configuration
"meow": "^13.2.0", // CLI argument parsing
"cli-meow-help": "^4.0.0" // CLI help generation
}Project Structure
phpreaction-legacy-frontend-scraper-for-generator/
├── index.js # Entry point and main orchestration
├── login.js # Authentication module
├── scraper.js # Core scraping logic and WebScraper class
├── .env # Environment configuration
├── utils/
│ ├── cli.js # CLI argument parsing
│ └── functions.js # Utility functions
├── scrapedData/ # Output directory
│ ├── en/ # English extracted data
│ │ └── [PanelName]Bundle/
│ │ └── [MenuName]/
│ │ └── index.js # Contains listing_data and show_data
│ └── fr/ # French extracted data
│ └── [PanelName]Bundle/
│ └── [MenuName]/
│ └── index.js
├── package.json
├── README.md
└── CHANGELOG.mdWebScraper Class Architecture & Data Flow
The WebScraper class implements a sophisticated extraction pipeline:
class WebScraper {
constructor(baseUrl, options = {}) {
// Core browser automation setup
this.baseUrl = baseUrl; // Target legacy frontend URL
this.browser = null; // Puppeteer browser instance
this.page = null; // Active browser page
// Extraction configuration
this.options = {
outputDir: "./scrapedData/[language]", // Multi-language output
maxRetries: 3, // Resilience handling
retryDelay: 1000, // Retry backoff
timeout: 500000, // Operation timeout
headless: false, // Debug visibility
};
}
// Key extraction methods:
// - processAllPanels(): Navigates accordion menu structure
// - extractListingData(): Captures table headers and data patterns
// - extractShowPageData(): Extracts detail view field structure
// - extractListingFilters(): Captures filter configurations
// - extractFormElements(): Scrapes form field definitions
}Installation & Setup
Prerequisites
Ensure you have the following installed:
- Node.js (Latest LTS version recommended)
- Chrome/Chromium browser for Puppeteer
- Network Access to the target legacy frontend
Installation Steps
Clone Repository
git clone https://github.com/PHPCreation/phpreaction-legacy-frontend-scraper-for-generator.git
cd phpreaction-legacy-frontend-scraper-for-generatorInstall Dependencies
npm installEnvironment Configuration
Create or configure your .env file:
# Target URL for scraping
SCRAPE_URL=https://your-legacy-frontend.example.com
# Authentication credentials
AUTH_USERNAME=your-username
AUTH_PASSWORD=your-password
# Chrome executable path (adjust based on your system)
CHROME_EXECUTABLE_PATH=/path/to/chrome
# Language preference (en/fr)
SCRAPER_LANGUAGE=enVerify Setup
npm run helpUsage & Operation
Basic Commands
Standard Scraping
# Run complete scraping process
npm start
# Equivalent to
node index.js
CLI Options
The scraper supports various command-line options:
# Language selection
node index.js --language fr # French
node index.js --language en # English (default)
# Processing modes
node index.js --parameters # Parameters only
node index.js --bundle Name --url URL # Specific bundle
node index.js --url URL # Single URL
# Help
node index.js --helpConfiguration Options
// Scraper configuration options in scraper.js
{
outputDir: "./scrapedData/[language]", # Output directory
maxRetries: 3, # Retry attempts
retryDelay: 1000, # Delay between retries (ms)
timeout: 500000, # Overall timeout (ms)
headless: false # Browser visibility
}Data Extraction Architecture & Methodology
1. Authentication & Session Management
The Login class provides robust authentication for legacy systems:
// login.js - Enterprise-grade authentication
class Login {
async login(username, password) {
// Handles legacy CSRF tokens
// Manages session persistence
// Implements retry logic for network failures
// Supports multi-language login interfaces
}
}2. Hierarchical Panel Processing
Sophisticated accordion menu navigation system:
// Core extraction workflow
processAllPanels() {
// 1. Detect accordion menu structure (#accordion-menu)
// 2. Extract panel information (name, icon, state)
// 3. Navigate through collapsed/expanded states
// 4. Process menu items within each panel
// 5. Maintain hierarchical data structure
// Output: ./scrapedData/[language]/[PanelName]Bundle/[MenuName]/
}3. Multi-Dimensional Data Extraction
Legacy UI Pattern Recognition
The scraper identifies and preserves critical UI patterns:
// Listing page extraction (table.styled-list)
const listingData = [
"Id", // Column headers in display order
"Account",
"Ref",
"Type", // Preserves legacy field ordering
"Parent",
];
// Filter extraction (#panel-filters)
const listingFilters = [
{
id: "tag",
label: "Tags", // Original legacy labels
filterType: "multiSelect",
inputType: "text", // UI component type
isMultiSelect: true,
},
];
// Detail view extraction (.show #show-fields)
const showData = [
{
title: "Basic Information", // Section groupings
fields: ["fieldName1", "fieldName2"], // Field organization
},
];Advanced Component Recognition
Enterprise-Level Pattern Extraction
- Sidebox Recognition: Captures related entity panels and relationships
- Action Button Mapping: Identifies view/edit/delete action patterns
- Form Field Extraction: Preserves input types, validations, and structure
- Multi-language Content: Maintains translations and locale-specific configurations
- Custom UI Elements: Recognizes specialized components and their configurations
4. Structured Data Output & Generator Integration
The scraper produces structured data files that directly integrate with the CRUD generator’s legacy processing system:
// Generated index.js files in scrapedData structure
export const listingData = [
"Id", // Field order preserved from legacy UI
"Account",
"Ref",
"Type",
];
export const listingFilters = [
{
id: "tag",
label: "Tags", // Original UI labels
filterType: "multiSelect", // Component type mapping
inputType: "text",
isMultiSelect: true,
},
];
export const showData = [
{
title: "Basic Information", // Section organization
fields: ["fieldName1"], // Field structure
},
];
export const formData = {
// Form field configurations
groupName: [
{
label: "Field Label",
type: "input_type", // Legacy input patterns
},
],
};
export const sideboxData = [
{
// Related entity panels
title: "Related Items",
icon: "glyphicons-list",
panelId: "#related-panel",
},
];Generator Integration Process:
// In CRUD generator's legacy-integration.js
const legacyData = findLegacyData(entityName);
const syncedFields = processFieldsWithLegacy(
fields,
legacyData.listingData,
"listing"
);
// Result: Metadata fields reordered to match legacy UI patternsComplete Integration Workflow & Data Flow
1. Three-Tier Legacy Migration Pipeline
Stage 1: Legacy Data Extraction (Scraper)
Input: PHPReaction legacy frontend URL + credentials
Process: Puppeteer-based navigation and DOM extraction
Output: scrapedData/[lang]/[Bundle]/[Entity]/index.js
// Scraper extracts:
- Panel structure (#accordion-menu)
- Listing headers (table.styled-list thead)
- Filter configurations (#panel-filters)
- Detail view sections (.show #show-fields)
- Form field patterns (edit pages)
- Action button mappings (.btn-action)Stage 2: Metadata Enhancement (Generator)
Input: API metadata + scraped legacy data
Process: Field synchronization via legacy-integration.js
Output: Enhanced entity configurations
// Generator processes:
const legacyData = findLegacyData("AccountingBundle\\\\Account");
const orderedFields = syncFields(apiFields, legacyData.listingData, "listing");
// Result: API fields reordered to match legacy UI sequenceStage 3: CRUD v2 Configuration (Final Output)
Input: Enhanced configurations
Process: Template-based generation
Output: crud_config/[entity]/index.tsx
// Final generated config maintains legacy order:
const mainColumns: MainColumnsListing[] = [
{ title: "Id", key: "id", sortable: true }, // Legacy order preserved
{ title: "Account", key: "account", sortable: true },
{ title: "Ref", key: "ref", sortable: true },
];2. Data Synchronization Mechanisms
The integration employs sophisticated field matching algorithms:
// Field normalization for matching
function normalizeName(name) {
return name
.toLowerCase()
.replace(/s?bundle$/, "") // Remove bundle suffix
.replace(/\s+code$/i, "code") // Normalize code fields
.replace(/^(app|phpreaction)entity/, "") // Remove entity prefixes
.replace(/\s+/g, ""); // Remove whitespace
}
// Synchronization process
function syncFields(fields, legacyFields, type) {
// 1. Create normalized name mappings
// 2. Match API fields to legacy field order
// 3. Preserve legacy labels and titles
// 4. Maintain original UI structure
return orderedFields; // Fields in legacy UI order
}3. Multi-Language Support Integration
Handles legacy systems with multiple language interfaces:
# Scraper output structure
scrapedData/
├── en/ # English interface extraction
│ └── AccountingBundle/
│ └── Accounts/
│ └── index.js # English field labels
└── fr/ # French interface extraction
└── AccountingBundle/
└── Accounts/
└── index.js # French field labelsLegacy Preservation Strategy
The integration workflow ensures that:
- Field Order: Maintained from legacy listing tables
- Label Consistency: Original UI text preserved
- Filter Logic: Legacy filter patterns mapped to v2 components
- Section Organization: Detail view groupings maintained
- Action Patterns: Button configurations transferred
Legacy Data Structure & Generator Integration
Hierarchical Output Organization
The scraper organizes extracted data to mirror legacy frontend structure:
scrapedData/
├── en/ # English UI extraction
│ ├── AccountingBundle/ # Legacy panel grouping
│ │ ├── Accounts/
│ │ │ └── index.js # Account entity data
│ │ ├── Entries/
│ │ │ └── index.js # Entries entity data
│ │ └── Transactions/
│ │ └── index.js
│ ├── BillsBundle/ # Billing module
│ │ ├── Bills/
│ │ └── Payments/
│ ├── entityList/ # Cross-bundle entity registry
│ │ └── EntityList.js # All discovered entities
│ └── schema.js # Complete menu structure
└── fr/ # French UI extraction
└── [Identical structure with French labels]Generator Integration Points
The CRUD generator locates and processes scraped data through multiple strategies:
// Generator's legacy-integration.js
function findLegacyData(entityName) {
// 1. Parse entity name: 'AccountingBundle\\\\Account'
const entity = "Account"; // Extract entity name
const bundle = "AccountingBundle"; // Extract bundle name
// 2. Normalize for file system matching
const normalizedEntity = normalizeName(entity); // 'account'
const normalizedBundle = normalizeName(bundle); // 'accounting'
// 3. Search scraped data structure
const bundleFolder = "./scrapedLegacy/AccountingBundle";
const entityFolder = findEntityFolder(bundleFolder, normalizedEntity);
// 4. Load and parse legacy data
const legacyContent = fs.readFileSync(entityFolder + "/index.js");
return parseLegacyData(legacyContent); // Extract exported constants
}Actual Scraped Data Format (Generator-Compatible)
Each index.js contains precisely structured exports that the generator’s legacy-integration.js parses:
// Real scraped data format from legacy frontend
export const listingData = [
"Id", // Simple array of column headers
"Account", // Extracted from table.styled-list thead th
"Ref",
"Type",
"Parent", // Order preserved from legacy UI
];
export const listingFilters = [
{
id: "tag", // Filter field identifier
label: "Tags", // Original legacy label
filterType: "multiSelect", // UI component type
inputType: "text", // Input method
isMultiSelect: true, // Selection behavior
placeholder: null, // UI placeholder
dataFetchUrl: null, // Dynamic data source
},
{
id: "disabledAt",
label: "Enabled", // Legacy terminology preserved
filterType: "select",
inputType: "text",
isMultiSelect: false,
},
];
export const showData = [
{
title: "Basic Information", // Section from .show #show-fields h2
fields: [
// Fields from .field-name elements
"fieldName1",
"fieldName2",
],
},
{
title: "Advanced Settings", // Multiple sections preserved
fields: ["advancedField1", "advancedField2"],
},
];
export const formData = {
// Form structure from edit pages
group1: [
{
// Form field groupings
label: "Account Type", // Original form labels
type: "select", // Input type detection
},
],
};
export const sideboxData = [
{
// Related entity panels
title: "Related Transactions", // Panel titles
icon: "glyphicons-list", // Icon classes
panelId: "#related-panel", // DOM references
},
];Generator Processing:
// How generator uses this data
function processFieldsWithLegacy(entityName, fields, type) {
const legacyData = findLegacyData(entityName);
switch (type) {
case "listing":
// Uses legacyData.listingData array to reorder API fields
return syncFields(fields, legacyData.listingData, "listing");
case "form":
// Uses legacyData.formData object for form field organization
return syncFields(
fields,
Object.values(legacyData.formData).flat(),
"form"
);
case "show":
// Uses legacyData.showData sections for detail view structure
const legacyFields = legacyData.showData.flatMap(
(section) => section.fields
);
return syncFields(fields, legacyFields, "show");
}
}Error Handling & Troubleshooting
Common Issues & Solutions
1. Chrome Path Issues
Problem: Chrome executable not found
Solutions:
- Verify Chrome installation:
which google-chromeorwhere chrome.exe - Update
CHROME_EXECUTABLE_PATHin.env - Install Chrome/Chromium if missing
2. Authentication Failures
# Common authentication issues:
# 1. Invalid credentials
# 2. Network connectivity issues
# 3. CSRF token problems
# 4. Session timeout
# Solutions:
# - Verify credentials in .env
# - Check network connectivity
# - Clear browser cache/cookies
# - Check for CSRF error messages in PHPR listing pages3. Language Switch Failures
- Issue: Language switching not working
- Solution: Verify supported languages in target application
- Check: Language parameter in CLI arguments
4. Scraping Timeouts
// Adjust timeout settings in scraper.js
{
timeout: 500000, // Overall timeout (increase if needed)
maxRetries: 3, // Retry attempts
retryDelay: 1000 // Delay between retries
}Debugging Options
# Enable verbose logging
node index.js --verbose
# Run with browser visible for debugging
# Set headless: false in scraper.js options
# Test specific URL
node index.js --url "https://example.com/debug-page"Development & Maintenance
Development Workflow
The project follows standard Node.js development practices:
{
"scripts": {
"start": "node index.js",
"help": "node index.js --help",
"start:parameters": "node index.js --parameters",
"prepare": "husky install",
"commit": "git-cz"
}
}Production Guidelines & Legacy Migration Strategy
Enterprise Deployment Best Practices
Production Scraping Guidelines
- Legacy System Access: Coordinate with system administrators for stable access windows
- Multi-Language Extraction: Run separate scraping sessions for each locale (en/fr)
- Data Validation Pipeline: Implement automated validation of scraped data completeness
- Version Control Integration: Commit scraped data changes to track legacy system evolution
- Generator Synchronization: Ensure scraped data format matches generator expectations
- Incremental Updates: Re-scrape only changed entities rather than full system scraping
Performance Optimization
// Optimized scraper configuration
{
headless: true, // For production runs
timeout: 300000, // Reasonable timeout
maxRetries: 2, // Avoid excessive retries
concurrent: false # Avoid overwhelming target server
}Security Considerations
- Credential Management: Use environment variables for sensitive data
- Rate Limiting: Implement delays between requests
- Error Handling: Don’t expose sensitive information in logs
- Network Security: Use HTTPS connections where possible
Real-World Integration Examples
Complete Legacy Migration Workflow
# Step 1: Extract legacy frontend structure
cd phpreaction-legacy-frontend-scraper-for-generator
SCRAPER_LANGUAGE=en npm start
# Output: scrapedData/en/[Bundle]/[Entity]/index.js files
# Step 2: Copy scraped data to generator
cp -r scrapedData/* ../frontend-components-crud-react-generator/scrapedLegacy/
# Step 3: Generate CRUD config with legacy integration
cd ../frontend-components-crud-react-generator
node index.js --entityName "AccountingBundle\\\\Account" \
--bundleCrud "phprCrud" \
--resourceName "accounts" \
--outputPath "./generated/accounts/index.tsx"
# Step 4: Deploy to CRUD v2 project
cp generated/accounts/index.tsx ../phpreaction-frontend-crud-react-v2/src/crud_config/accounts/Generator Integration Deep Dive
How the generator processes scraped legacy data:
// In ticket-process.js - generateConfiguration()
const originalListingFields = readJsonFile(listingFieldsPath); // From API metadata
const allFields = readJsonFile(fieldsPath); // From API metadata
// Legacy integration occurs here:
let listingLegacyFields = processFieldsWithLegacy(
entityName, // 'AccountingBundle\\\\Account'
listingFields, // API-provided fields
"listing" // Processing type
);
// Result: API fields reordered to match legacy UI sequence
// Example: ['id', 'name', 'type'] becomes ['name', 'type', 'id'] if legacy UI shows this order
// Final config generation
AddMainColumns(filteredAllFields, outputPath); // Uses reordered fields
AddFormInputs(entityName, allFields, requiredFields, outputPath);
AddFilterFields(entityName, requiredSelectedFields, resourceName, outputPath);CRUD v2 Generated Configuration
Final output maintains legacy patterns:
// Generated crud_config/accounts/index.tsx
const mainColumns: MainColumnsListing[] = [
{
title: "Account", // Legacy label preserved
key: "account", // API field mapped
sortable: true,
},
{
title: "Ref", // Legacy order maintained
key: "ref",
sortable: true,
},
{
title: "Type", // Follows legacy sequence
key: "type",
sortable: true,
},
];
// Order matches scraped listingData: ["Account", "Ref", "Type"]Support & Resources
Documentation Resources
- Project Repository: GitHub Repository
- Issue Tracking: GitHub Issues
- Related Projects:
This documentation provides comprehensive coverage of the legacy frontend scraper’s role in the PHPReaction migration ecosystem, detailing its sophisticated data extraction mechanisms, integration with the CRUD generator’s legacy processing pipeline, and the complete workflow from legacy UI preservation to modern CRUD v2 configuration generation.