Engineering

    How We Built the Career Matching Architecture at PathScorer

    A technical retrospective on transformer embeddings, occupation graphs, and the infrastructure decisions that actually worked.

    When we started building PathScorer, the hardest conversation we kept having internally wasn’t about machine learning. It was about representation. Specifically: what does a person’s career look like as a data structure, and what does the labor market look like as a data structure, and how do you put them in the same space so that a similarity calculation means something real?

    Everything downstream of that question, the matching logic, the salary layer, the gap analysis, the geographic optimization, only works if the answer to that foundational question is correct. We spent a disproportionate amount of the early architecture phase on it, and looking back, that was the right call.

    Here’s what we built, why we built it that way, and what we’d do differently.

    The representation problem

    A resume is natural language. The O*NET occupation database is a structured numerical taxonomy. A person asking “what should I do next with my career” is asking a question that spans both.

    The naive approach is to extract keywords from the resume, match them against keywords in occupation descriptions, and rank by overlap. This is essentially what job boards do, and it’s why job boards are bad at career discovery. Keywords match vocabulary, not meaning. “Revenue optimization” and “sales performance improvement” describe the same activity and share zero tokens.

    What you need is a space where semantic similarity can be computed, where two phrases that mean approximately the same thing sit close together regardless of the specific words used. This is what dense vector embeddings give you, and the transformer architecture is what makes them good enough to be useful in production.

    We fine-tuned a BERT-based model on a domain-specific corpus assembled from O*NET documentation, occupational analyst reports, and resume text. The domain adaptation matters more than most people expect. Out-of-the-box BERT has seen a lot of text, but it hasn’t seen much text about the difference between a logistics coordinator and a procurement specialist, or what distinguishes a clinical informatics role from a standard nursing position. Fine-tuning on labor market language shifts the embedding space in ways that make the downstream similarity calculations meaningfully more accurate.

    The result is that when we embed “managed vendor relationships and negotiated contracts for inbound freight” from a resume and compare it to the O*NET description of a Procurement Specialist, the cosine similarity is high. The words barely overlap. The meanings are close. That’s the foundation everything else builds on.

    Resume parsing as an NLP pipeline

    The entry point for every user is a document, either a PDF resume or plain-text input. The parsing pipeline runs three specialized stages in sequence, each with its own model.

    Named Entity Recognition handles the structural extraction: company names, job titles, dates, institutions, certifications, and explicit skill mentions. We use a RoBERTa-based NER model fine-tuned on labeled resume data. Standard off-the-shelf NER performs adequately on well-formatted resumes and degrades sharply on anything unusual, heavily formatted PDFs, non-standard layouts, resumes that mix languages. The fine-tuned version handles edge cases considerably better because it’s seen the actual distribution of formatting decisions real people make.

    The NER output feeds a temporal parser that reconstructs the career timeline: role sequence, tenure at each position, gap detection, and seniority trajectory inference. We treat the career arc itself as a feature. Someone who moved from analyst to senior analyst to manager over seven years at two companies is carrying different signal than someone who made lateral moves at the same seniority level for the same period. The trajectory shape matters for predicting fit with roles that require either deep technical specialization or demonstrated people management.

    The most technically interesting stage is implicit skill extraction. Explicit skills are easy. “Proficient in Salesforce, Excel, and SQL” is unambiguous NER. Implicit skills require inference from natural language descriptions of work.

    Consider a resume bullet like “Led cross-functional rollout of new inventory system across three distribution centers, coordinating with IT, operations, and finance stakeholders.” The words “project management,” “change management,” and “stakeholder communication” don’t appear anywhere in that sentence. But a multi-label classification model trained to recognize the semantic patterns associated with those skills will assign high probability to all three.

    We implement this as a sequence classification head on top of the fine-tuned encoder. Each responsibility description gets encoded into a fixed-length representation, which then gets classified against our full skill taxonomy using a sigmoid output layer rather than softmax, because a single responsibility description can and usually does map to multiple skills simultaneously.

    Pipeline output (structured representation)

    career_timeline:
      - title_raw: "Operations Manager"
        title_normalized: "General and Operations Managers"
        onet_code: "11-1021.00"
        tenure_months: 38
        seniority_level: mid
        explicit_skills:
          - inventory_management
          - erp_systems
          - team_leadership
        implicit_skills:
          - vendor_negotiation       confidence: 0.87
          - budget_management        confidence: 0.79
          - process_optimization     confidence: 0.91
          - cross_func_coordination  confidence: 0.84

    Confidence scores propagate forward through the entire pipeline. A match that’s driven primarily by high-confidence explicit skill signals is surfaced differently than one driven by moderate-confidence inferences. The uncertainty doesn’t get averaged away; it stays attached to the result.

    Want to see this architecture in action on your profile? Upload your resume and PathScorer runs the full pipeline — 1,000+ occupations, real salary data. Two minutes, free.

    Score my career — free

    Taxonomy alignment: from natural language to O*NET dimensions

    The extracted skills are in natural language. O*NET describes occupations across 35 standardized skill dimensions. Bridging that gap is the taxonomy alignment layer.

    A lookup table works for common, unambiguous terms and fails for everything outside that set. We don’t use a lookup table as the primary mechanism.

    Instead, each O*NET skill dimension gets a dense vector representation built by embedding its official label, its O*NET definition, and a set of behavioral anchors drawn from O*NET’s level descriptors. “What does it look like when someone demonstrates high Persuasion skill?” The anchor descriptions are embedded and averaged to produce a richer representation of each dimension than the label alone provides.

    This handles three problems that a lookup table can’t. First, synonymy: “revenue growth hacking” maps to Persuasion and Sales because its embedding sits near those dimension vectors regardless of the vocabulary mismatch. Second, domain specificity: “presenting to the C-suite” maps to Speaking and Coordination with high confidence because the embedded context is clear. Third, ambiguity resolution: “Python” alone is genuinely ambiguous, but “Python” in a context rich with data modeling and statistical analysis vocabulary produces a contextualized embedding that maps cleanly to Systems Analysis and Mathematics, distinct from “Python” in an infrastructure context, which maps toward Operations Analysis.

    Building the user skill vector

    After parsing and taxonomy alignment, the user’s profile collapses into a vector in O*NET skill space, plus additional dimensions for features that don’t map directly into the 35-dimension taxonomy: total years of experience, seniority trajectory slope, career diversity across industries, and hidden skill inputs from the intake questionnaire.

    Vector construction: weighted aggregation

    v_user[i] = Σ(weight_j × skill_score_j[i]) for all evidence j

    Weights are functions of:

    recency — exponential decay over time since role

    tenure — longer exposure increases weight

    confidence — explicit > inferred > hidden-skill assertion

    seniority — senior role mentions weighted higher per dimension

    The recency decay uses an exponential rather than a step function because skills don’t disappear when you stop using them; they attenuate. A skill demonstrated three years ago contributes to the vector at reduced weight. A skill demonstrated last year is close to full weight. A skill mentioned in a role from ten years ago is present but faint.

    Hidden skills from the intake questionnaire, side projects, languages, volunteer work, hobbies, receive a flat confidence weight: the user is directly asserting them, so we don’t apply inference uncertainty. They run through the same taxonomy alignment pipeline as resume content and contribute to the same dimensions.

    The occupation embedding space

    O*NET provides 35-dimensional skill profiles for 1,000+ occupations directly. These form the base occupation vectors. We enrich them through a separate process before they’re used for matching.

    Each occupation has a substantial corpus of associated text: its O*NET description, its work activities list, its task statements, its knowledge requirements documentation, and the text of its representative job postings drawn from historical data. We run this full corpus through the domain-fine-tuned encoder and produce a 768-dimensional semantic embedding for each occupation, distinct from the 35-dimensional O*NET vector.

    Occupation dual representation

    v_onet[35] — O*NET skill dimensions (interpretable, gap analysis)

    v_semantic[768] — transformer embedding of occupation corpus (dense, cross-sector discovery)

    v_final = α·v_onet + (1-α)·domain_adjusted   (blend for primary scoring)

    v_search = v_semantic   (for surfacing non-obvious matches)

    The semantic embeddings are what make cross-sector discovery work. Two occupations that share significant underlying skill requirements but live in completely different parts of the O*NET taxonomy will sit close together in the 768-dimensional semantic space even when their 35-dimensional O*NET profiles show only moderate overlap.

    The occupation graph and graph neural network layer

    Occupation vectors define points. The occupation graph defines relationships between those points: weighted edges connecting occupations that share meaningful skill overlap.

    We construct the graph by computing pairwise cosine similarity across the 35-dimensional O*NET vectors for all occupation pairs, applying a threshold to create edges, and weighting edges by similarity score. The resulting graph has roughly 1,000 nodes and several tens of thousands of edges, with a weight distribution heavily skewed toward weak connections and a smaller set of strong connections between close skill neighbors.

    What makes the graph architecture worth the additional complexity is what you can do with Graph Neural Networks. A GNN layer propagates information from each node’s neighborhood into its representation. After one hop, each occupation’s representation incorporates signals from its immediate skill neighbors. After two hops, it incorporates signals from its neighbors’ neighbors.

    We use a GraphSAGE-based architecture specifically because of its inductive learning capability: it can produce meaningful representations for new occupation nodes without full graph retraining, which matters for handling emerging roles that don’t yet have complete O*NET profiles. The aggregation function is mean pooling over sampled neighborhoods, which we found performed comparably to more complex aggregators (LSTM, max pooling) on our occupation graph while being substantially faster to train and serve.

    The GNN-derived occupation embeddings feed into the re-ranking stage rather than the initial retrieval stage. Initial retrieval uses FAISS over the dense semantic embeddings for speed. Re-ranking incorporates the GNN representations to improve ordering, particularly for cross-sector matches where the structural position of an occupation in the skill graph is informative about transition feasibility.

    Matching: retrieval and re-ranking

    The matching pipeline has two stages with different performance characteristics and objectives.

    Stage 1: Approximate Nearest Neighbor retrieval

    We use FAISS with an IVF-PQ index over the 768-dimensional semantic occupation embeddings. The index is built offline on a weekly schedule and loaded into memory at service startup. At query time, a user’s semantic representation retrieves the top-k most similar occupations in sub-millisecond time.

    Stage 2: Re-ranking

    The top-k candidates from ANN retrieval go through a re-ranker that computes a refined score incorporating features the semantic similarity calculation doesn’t capture. The re-ranker is a gradient boosted tree (XGBoost) rather than a neural model, for two reasons. First, the feature set is heterogeneous: dense embeddings, scalar features, categorical features, and user priority weights. Gradient boosted trees handle mixed feature types natively. Second, they’re fast at inference time and interpretable enough that we can debug anomalous rankings without running attribution analysis.

    Re-ranker input features

    Embedding similarity

    cosine_sim_onet (35-dim)

    cosine_sim_semantic (768-dim)

    gnn_structural_sim

    Gap features

    skill_gap_count

    skill_gap_severity

    max_gap_dimension

    gap_closability_score

    Transition features

    career_distance (graph hops)

    industry_cross (same vs different)

    seniority_match

    User priorities

    salary_weight

    stability_weight

    balance_weight

    risk_tolerance

    Occupation quality

    automation_risk_score · ten_year_growth_rate · salary_percentile_local

    The salary layer and geographic optimization

    After re-ranking produces a scored list of occupation matches, the salary layer runs as a separate enrichment pass.

    BLS Occupational Employment and Wage Statistics data is stored in a normalized relational structure: occupation code, MSA code, employment count, and wage percentiles (10th, 25th, median, 75th, 90th). The table covers roughly 1,000 occupations across 600+ metropolitan areas.

    Geographic salary optimization

    For occupation O, user location L, relocation radius R:

    salary_local = BLS(O, L, median)

    candidate_metros = MSAs within distance R from L

    salary_max_reachable = max(BLS(O, m, median) for m in candidates)

    geo_uplift = salary_max_reachable - salary_local

    if geo_uplift > threshold → surface relocation recommendation

    The salary layer also computes the total addressable salary improvement for each match: the combination of occupational uplift and geographic uplift. This composite figure is what drives the “$20K to $60K more” framing in the output.

    Serving infrastructure and latency management

    The two-minute end-to-end latency target requires careful pipeline staging. The computationally expensive step is resume parsing, specifically the implicit skill extraction pass, which runs transformer inference over potentially dozens of responsibility descriptions.

    Track A: Async parsing queue

    Resume → NER

    → Temporal parser

    → Implicit skill extractor

    → Taxonomy alignment

    → User vector construction

    Runs while user completes intake

    Track B: Synchronous intake flow

    Confirm extracted skills

    Add hidden skills

    Set priority weights

    ~90 seconds in foreground

    Both tracks complete → match request fires → ANN retrieval + re-ranking + salary enrichment → results. The actual matching computation runs at the end and takes under 500 milliseconds.

    The FAISS index and GNN embeddings live in memory, loaded at service startup from a model artifact store. They’re rebuilt on a weekly schedule. The salary data updates more frequently, on a quarterly schedule aligned with BLS publication cycles.

    What we got wrong the first time

    The first version of the user vector construction used a simple average over all extracted skill signals without recency weighting. The resulting vectors overfit to the earliest roles in a person’s career history, because earlier roles often involve more varied and explicit skill development than later, more specialized roles. The recency-weighted aggregation fixed this but introduced a new edge case: people returning to work after a career break. We added a “maintenance signal” to the hidden skill intake specifically for this.

    The second significant mistake was using only O*NET skill dimensions for the initial ANN retrieval. The 35-dimensional space is too coarse for cross-sector discovery. Adding the 768-dimensional semantic embedding space as the primary retrieval index improved cross-sector match quality substantially.

    The third mistake was not separating retrieval from re-ranking in the initial architecture. The first version ran a single scoring pass over all occupation vectors. It was accurate but slow. The retrieval-then-rerank architecture took longer to build but is significantly more maintainable and extensible.

    Where the architecture goes next

    Temporal modeling of skill value is the most interesting open problem. Skill requirements shift as industries evolve, and a skill that was rare and valuable three years ago may be commoditized today. A temporally-aware skill value model would make the gap analysis more forward-looking.

    Counterfactual career path modeling, specifically “given where you are now, what would the ten-year earnings trajectory look like across different transition choices,” would add a time dimension to the current point-in-time salary comparison.

    Better uncertainty quantification throughout the pipeline would improve the output for users making high-stakes decisions. A more rigorous treatment of uncertainty, particularly for cross-sector matches where the training data is thinner, would help users understand when to weight the system’s recommendations heavily and when to treat them as one input among several.

    The map we’ve built is the most detailed representation of the U.S. labor market that we’re aware of in a consumer-facing career product. The interesting work now is in making it more accurate, more temporally aware, and more honest about what it doesn’t know.

    Run your analysis — free

    PathScorer runs on the architecture described here: transformer-based skill extraction, O*NET-grounded occupation vectors, graph neural networks, and BLS salary data across 600+ metros.

    Score my career — free
    career matching architecturetransformer resume parsingO*NET occupation graphFAISS career recommendationGNN skill matchingNLP labor market