#####################################################################
# _____ ___ ______ ___ ___ ___ ____ ____ _____ #
# / ___// _ \/_ __/____ |_ | / _ ) / _ | / __// _// ___/ #
# / (_ // ___/ / / /___// __/ / _ |/ __ | _\ \ _/ / / /__ #
# \___//_/ /_/ /____/ /____//_/ |_|/___//___/ \___/ #
# ____ ___ ____ __ _____ ____ __ ___ ___ ___ ______ #
# / / / ( _ ) / __/ _/_// ___// __ \ / |/ // _ \ / _ |/_ __/ #
# /_ _// _ |/ _ \ _/_/ / /__ / /_/ // /|_/ // ___// __ | / / #
# /_/ \___/ \___//_/ \___/ \____//_/ /_//_/ /_/ |_|/_/ #
#####################################################################
What if transformer models had been invented during the 486 era?
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
The current production path is the promoted MODEL_LEXICON_GOLD_V4_S3000
checkpoint running inside the DOS GPT2.EXE program. The model is
trained/exported on the host, copied into C:\MODEL, and executed by the
FreeBASIC fixed-point transformer runtime.
Verified production surface:
- DOS FreeBASIC build of
src/main_prod.basstaged asC:\GPT2SRC\MAIN.BAS - slim production executable, with legacy/lab modules staged separately as
LABMAIN.BAS - 4096-token longest-match lexicon tokenizer with printable byte fallback
- learned token and position embeddings
- causal attention, feed-forward blocks, layer norms, and output head
- Q20.12 fixed-point weights in
GPT2FX.BIN - fixed-point attention exp table in
GPT2EXP.BIN - optional q4/log compressed token-embedding artifact in
GPT2TQ4.BIN - optional q4/log compressed output-head artifact in
GPT2HQ4.BIN - optional output-head shortlist artifact in
GPT2HSL.BIN - KV decode cache for in-window generation
- rolling fixed decode once prompt-plus-output exceeds the exported context window
- deterministic greedy decode for evidence runs, plus fixed-point temperature/top-k/top-p sampling for interactive runs
- machine-readable DOS educational trace mode with prompt-token and generation-step records
- DOS non-greedy sampling matrix for temperature/top-k/top-p release evidence
- optional DOS-emitted kernel-stage timing with
--kernel-perf - parity vectors, DOS quality logs, and DOS-emitted
PERF_*timing records
Legacy matrix, block-sparse, synthetic benchmark, and diagnostic smoke-test
modules remain in the repository as lab code. The release build now avoids
compiling them into GPT2.EXE; the QEMU staging script keeps the old combined
driver available as LABMAIN.BAS for experiments. The q4/log vocabulary-tensor
path is no longer just lab code: token embeddings and the output head are wired
through host export, DOS loading, vector parity, host quality, and QEMU --perf
as a low-memory release mode.
The default checkpoint is a 2-layer, 48-dimensional, 4-head, 192-context model
with 463,168 parameters, Q20.12 fixed-point weights, and a DOS-loadable
VOCAB.BIN. Host float and host fixed quality both pass the full 10-prompt
suite at 10/10, average 0.968. DOS evidence for the same checkpoint is 10/10,
average 0.969 after the gold-v4 promotion. Physical hardware timing is
still required for board-specific speed claims.
The slim production build currently compiles to a 309,760-byte GPT2.EXE.
The optional speed candidate,
assets/gpt2_basic/MODEL_HEADSHORTLIST2048_PROD_PROBE, keeps the full Q20.12
weights resident and adds GPT2HSL.BIN, a 2,048-token output-head shortlist.
It passes host fixed quality and DOS vector parity, raises runtime memory only
from 2,055,940 to 2,064,148 bytes, and measures 3.35 tok/s on the QEMU
486DX2/66 kernel gate versus 2.41 tok/s for the full-head baseline.
The optional compressed release candidate,
assets/gpt2_basic/MODEL_TOKHEADQ4_PROD_PROBE, keeps the same 4096-token
lexicon and checkpoint behavior while replacing the resident token embedding
and output head with GPT2TQ4.BIN and GPT2HQ4.BIN. It passes DOS vector
parity and host fixed quality, reduces DOS runtime memory from 2,055,940 to
974,724 bytes, and measures 2.12 tok/s on the QEMU 486DX2/66 gate versus 2.46
tok/s for the full-resident default.
The lower-memory streaming candidate,
assets/gpt2_basic/MODEL_TOKHEADQ4_STREAM_PROD_PROBE, adds GPT2HQS.ON so DOS
streams packed output-head rows from GPT2HQ4.BIN instead of keeping the q4
codes and decode table resident. It passes DOS vector parity, lowers runtime
memory to 616,324 bytes, and measures 0.81 tok/s on the QEMU 486DX2/66 gate.
This is the real parameter-streaming path. It is deliberately kept as the
maximum-compatibility fallback rather than the default speed path.
Release mode choice:
| Mode | Model Directory | Runtime Memory | QEMU 486DX2/66 | Use When |
|---|---|---|---|---|
| Full resident | assets/gpt2_basic/MODEL |
2,055,940 B | 2.46 tok/s | best quality and simplest numeric path |
| Head shortlist | assets/gpt2_basic/MODEL_HEADSHORTLIST2048_PROD_PROBE |
2,064,148 B | 3.35 tok/s | fastest measured large-vocab path |
| q4 token+head | assets/gpt2_basic/MODEL_TOKHEADQ4_PROD_PROBE |
974,724 B | 2.12 tok/s | best low-memory default |
| q4 streamed head | assets/gpt2_basic/MODEL_TOKHEADQ4_STREAM_PROD_PROBE |
616,324 B | 0.81 tok/s | maximum RAM compatibility |
This implementation demonstrates that modern AI concepts like transformers are fundamentally just algorithms - mathematical operations that can be implemented even on hardware from decades ago. It bridges two worlds typically considered separate: cutting-edge AI and vintage computing.
Think of it as digital archaeology in reverse - building tomorrow's technology with yesterday's tools.
╔══════════════════════════════════════════════════════════════════╗
║ "We were so busy asking if LLMs could run on a 486, we didn't ║
║ stop to think if they should. The answer, by the way, is yes." ║
║ ║
║ — Anonymous DOS Enthusiast ║
╚══════════════════════════════════════════════════════════════════╝
This project serves multiple purposes:
-
Demystifying Modern AI: By stripping away the layers of optimization that make modern transformers inscrutable, we expose their fundamental mathematical operations.
-
Historical "What If?": Imagine an alternate timeline where transformers were invented in the early 1990s. How would they have been implemented with the constraints of the era?
-
Educational Tool: Learn about both transformer architecture and optimization techniques for constrained environments in an accessible way.
-
Bridge Between Communities: Connects retro-computing enthusiasts with modern AI concepts, and helps AI practitioners appreciate the elegance of optimization under constraints.
-
Proof of Concept: Demonstrates that with careful engineering, significant AI models can run on extremely limited hardware.
For a detailed academic analysis of this project, please refer to our technical white paper:
GPT-2 in BASIC: Implementing Modern Transformer Models on 486-Era Hardware
This extensive documentation includes:
- Detailed historical context of 486-era computing and early 1990s AI
- Complete technical explanations of all core innovations and optimization techniques
- Platform-specific implementation considerations
- Thorough performance analysis with benchmarking methodology
- Counterfactual historical analysis of how this implementation might have altered computing history
- Educational value and insights for modern edge AI development
- Future directions and applications
- Comprehensive academic references
The paper bridges technical implementation details with historical analysis to provide both practical insights and thought-provoking exploration of an alternate AI timeline.
Public release and media:
- v0.1.0-preview release
- Passing public release gate
- Public launch plan
- Video production plan
- Public demo script
- Promotional copy kit
- DOSBox integration
- Launch handoff kit:
gpt2-basic-launch-kit.zipon the preview release
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
╔════════════════════════════════════════════════════════════════╗
║ MINIMUM SYSTEM REQUIREMENTS (THEORETICAL) ║
║ ║
║ ■ Processor: 486DX4/100MHz ║
║ ■ Memory: 32MB RAM ║
║ ■ Storage: 10MB free disk space + swap file ║
║ ■ OS: MS-DOS 6.22 or compatible ║
║ ■ Display: VGA display (text mode) ║
║ ║
║ RECOMMENDED SYSTEM ║
║ ║
║ ■ Processor: 486DX4/100MHz or faster; Pentium optional ║
║ ■ Memory: 64MB RAM ║
║ ■ Storage: 20MB free disk space ║
║ ■ OS: MS-DOS 6.22 with HIMEM.SYS and EMM386.EXE ║
║ ║
║ DEVELOPMENT SYSTEM ║
║ ║
║ ■ FreeBASIC compiler or compatible BASIC variant ║
║ ■ DOSBox or 486-era hardware for testing ║
╚════════════════════════════════════════════════════════════════╝
The QEMU release path now targets the slim production program rooted at
src/main_prod.bas, staged for DOS as GPT2SRC\MAIN.BAS
and compiled inside FreeDOS with DOS FreeBASIC. The older combined driver is
staged as GPT2SRC\LABMAIN.BAS for experiments.
Train and export a baseline GPT2-BASIC checkpoint on the host with:
python3 scripts/train_gpt2_basic.py --profile 486sx-safeByte-level checkpoints remain supported for compatibility. The promoted default uses a DOS-loadable lexicon vocabulary. To train a new lexicon checkpoint:
python3 scripts/train_gpt2_basic.py --profile 486sx-safe --tokenizer lexicon --vocab-size 4096 --include-docs --corpus-file data/domain_curriculum/gold_curriculum_v2.txt --output assets/gpt2_basic/MODEL_LEXICON_NEWLexicon and BPE exports include MODEL/VOCAB.BIN. The DOS runtime loads
VOCAB.BIN from the executable directory or from MODEL\, validates that its
vocabulary size matches GPT2CFG.TXT, and then uses the same tokenizer mode for
prompt encoding and output decoding.
The original embedded trainer corpus is intentionally tiny and is no longer the production data source. The best current default was trained from the audited hand-curated gold curriculum:
python3 scripts/build_gold_curriculum.pyThe gold-curriculum result is documented in
qemu/evidence/gold_curriculum_v2_report.md. The project-owned generated
domain curriculum in data/domain_curriculum/domain_curriculum.txt is tracked
because the preview package ships that training surface. The conservative
online corpus is still available as warmup/provenance-tracked background text:
python3 scripts/fetch_online_training_corpus.py
python3 scripts/train_gpt2_basic.py --profile 486sx-safe --include-docs --corpus-file data/online_corpus/online_training_corpus.txt --corpus-weight 1 --output assets/gpt2_basic/MODEL_ONLINE_PRETRAINThe fetcher writes data/online_corpus/SOURCE_MANIFEST.json and
qemu/evidence/online_training_data_audit.md. Its default sources are
conservative public-domain/government/open-data text. Use
--include-sharealike only when the checkpoint distribution plan can carry the
required attribution and ShareAlike/GFDL obligations.
The first online-only candidate and the later domain/lexicon sweeps are
documented in qemu/evidence/domain_training_strategy_report.md. The current
default is the large-vocabulary gold v2 checkpoint because it beats the older
byte-domain candidate on both held-out and all-suite quality.
This writes assets/gpt2_basic/MODEL/GPT2CFG.TXT, GPT2WT.BIN, GPT2FX.BIN, and GPT2EXP.BIN. GPT2FX.BIN contains the Q20.12 fixed-point weights and GPT2EXP.BIN contains the fixed-point attention exp lookup table. The DOS executable loads them from C:\MODEL and runs its own tokenizer plus decoder-only transformer forward pass.
The trainer has named hardware profiles: 386-min, 486sx-safe, 486dx2-usable, 486dx4-plus, and pentium-best. These profiles are checkpoint shapes for host training and DOS inference; they still need QEMU and real-board timing before being quoted as final performance claims.
Validate the checkpoint before staging it into QEMU:
python3 scripts/model_report.py --model-dir assets/gpt2_basic/MODEL --strictThe QEMU helpers run this validation automatically and refuse to stage malformed model files.
Compile it under QEMU's 486 CPU model with:
bash qemu/compile_main_486.shThe successful in-VM build prints COMPILE_OK and produces C:\GPT2.EXE. The compile helper also copies the exported MODEL directory into the DOS hard disk image.
Run the compiled program with:
bash qemu/run_main_486.shRun the full fixed-point quality prompt suite with:
bash qemu/run_quality_486.shRun the educational trace suite with:
bash qemu/run_trace_486.shThat boots the real DOS executable with GPT2.EXE --trace, captures prompt
tokenization, each greedy forward/sample step, the decoded text, and final
context length into qemu/evidence/trace_486.log. This is the implemented
step-through teaching surface.
Run the VGA visual trace suite with:
bash qemu/run_visual_trace_486.shThat boots GPT2.EXE --trace, compiles the optional lab VISUAL.EXE
visualizer, switches the DOS process into Mode 13h when graphics are available,
draws token/progress bars from TRACE.LOG, and writes machine-readable
VISUAL_* records to qemu/evidence/visual_trace_486.log.
Run the pack-driven assistant shell with:
bash qemu/run_assistant_486.shThat is the scripted evidence path: it compiles the optional ASSIST.EXE
utility, loads PACKS\PACKS.TXT, discovers pack-local PACK.INI metadata,
switches the active model path per pack, retrieves pack notes, and emits
structured ASSIST_* records to qemu/evidence/assistant_486.log. The first
packs are CHAT, DOSHELP, and OFFICE.
Run the real interactive QEMU demo with:
bash qemu/run_assistant_interactive_486.shThis opens a QEMU window instead of a scripted evidence run. The DOS assistant
stays open until you type /quit. It starts in /pack CHAT, which is the
normal conversation pack. Use /about for current-pack instructions, /packs
to list packs, /pack DOSHELP for DOS/486 questions, /pack OFFICE for
writing tasks, and /history, /up, and /down for the in-DOS transcript.
The active pack model loads before the first > prompt is displayed, and a
new pack model loads immediately after /pack NAME, so the first question does
not pay the model-load cost.
During generation, the DOS shell streams Thinking: progress for prompt
tokens, context prefill, and output-token sampling, then streams Answer:
pieces as the fixed-point model produces tokens. Transcript paging is
implemented inside DOS so it does not depend on terminal scrollback; use /u
and /d as short aliases for /up and /down, or /h for history.
Each pack has its own USAGE.TXT; the repo-level index is
assets/gpt2_basic/PACKS/README.md.
The shared pack metadata contract for future Windows and OS/2 shells is
documented in docs/pack-shell-parity.md and validated by
scripts/assistant_pack_contract.py.
The current CHAT pack uses a pack-local 4096-token sentence-piece lexicon
checkpoint trained on broader casual English dialogue. CHAT\TOKBASE.TXT
is the tokenizer-basis corpus, including CHAT\LEXICON.TSV grammar words and
response-style phrase endings; CHAT\TRAIN.TXT is kept closer to clean
dialogue so the model does not learn lexicon scaffolding as answer text. Its
fixed quality suite is PASS 25/25, and the manual QEMU probe includes
ordinary prompts such as i am bored, tell me a joke, and do you like music.
Train and test every listed assistant pack model with:
python3 scripts/train_assistant_pack_models.pyTo reproduce the current larger CHAT checkpoint exactly, train it as a pack-local 486DX2 model instead of initializing from the smaller default model:
python3 scripts/train_assistant_pack_models.py --pack CHAT --profile 486dx2-usable --steps 3000 --tokenizer lexicon --vocab-size 4096 --quality-backend fixed --max-new-tokens 24 --no-init-modelThat builds pack corpora from PACK.INI, HELP.TXT, and optional
GOLDEN.TXT, fine-tunes one
pack-local checkpoint under PACKS\<ID>\MODEL, runs model_report.py, writes
pack-specific quality reports, and updates MODEL=PACKS\<ID>\MODEL. The
host quality sweep uses a 96-token reply window. The raw assistant prompt gate
adds 26 original prompts across CHAT, DOSHELP, and OFFICE and rejects label
leakage, repeated chunks, token soup, truncated endings, and off-topic replies.
The DOS assistant itself keeps interactive generation bounded to 64 tokens with
early sentence stopping, while the scripted 486 stress probe exercises
retrieval, golden, and model reply paths. SPRITE= and ICONS= fields are
reserved for Clippy-style
artwork; the current renderer is a text-mode bubble/action UI so it works
without VGA.
Run the non-greedy sampling matrix with:
bash qemu/run_sampling_486.shThat boots GPT2.EXE --sampling-matrix and writes
qemu/evidence/sampling_486.log. The matrix compares greedy, top-k, and
nucleus-style settings with fixed seeds and records generated-token counts,
timing, byte fallback, alphabetic byte fallback, sentence ending, and decoded
text for each row.
Rank exported checkpoints against held-out quality, memory, and available QEMU
--perf measurements with:
python3 scripts/profile_pareto_report.py --refresh-heldout-floatThe active MODEL row uses DOS fixed-point held-out evidence when available;
non-active checkpoint rows use host float held-out probes until they are staged
and run through DOS.
Audit every exported root checkpoint plus assistant pack-local model with:
python3 scripts/audit_exported_models.py --refresh-model-reportsThat writes qemu/evidence/exported_model_quality_inventory.md, validates each
checkpoint with model_report.py --strict, links the best matching quality
report, and makes missing or failing model evidence explicit.
Refresh the strict all-suite quality reports for every DOS-ready root export with:
python3 scripts/refresh_model_quality_reports.pyThat rewrites quality_report_<model>_all.md reports using the stricter gate
that rejects malformed fragments, unclean endings, and high phrase repetition.
Then write the quality repair plan with:
python3 scripts/plan_model_quality_repairs.pyThe plan records which exports are release-ready, which are historical failures to retire, which are host-only prototypes, and which profiles deserve another training run.
Write the preview-release manifest with:
python3 scripts/build_preview_release.py --manifest-onlyThat records the bounded iterative payload in
qemu/evidence/preview_release_manifest.md: strict-quality release models,
assistant packs, rebuild scripts, host verification tests, selected QEMU
evidence, and explicit exclusions for failed repairs and old candidates. To
build the local package tree and zip under /private/tmp, run:
python3 scripts/build_preview_release.py --forceThe curated preview release body is
docs/releases/v0.1.0-preview.md. The
package includes SHA256SUMS.txt, and the builder writes a zip-level
/private/tmp/gpt2-basic-preview.zip.sha256 sidecar for GitHub release
attachments. Attach the preview zip, DOSBox zip, hardware-transfer zip, all
.sha256 sidecars, and qemu/evidence/preview_release_manifest.md to the GitHub
prerelease. The Preview Release GitHub Actions workflow also uploads those
same files as the gpt2-basic-v0.1.0-preview workflow artifact after
verification. The release is intentionally marked as a QEMU-verified prerelease
until the same quality, performance, and assistant probes are captured from a
physical 486-class DOS machine. Pentium timing is useful scaling evidence, but
it is not a blocker for the solid 486-focused release. The hardware ladder is
tracked in docs/hardware-validation.md, with a
DOS capture batch under hardware/HWVALID.BAT, strict host verification
through scripts/verify_hardware_capture.py --require-filled-notes, and
release evidence staging through scripts/stage_hardware_capture_evidence.py.
The physical-only performance table is generated by
scripts/hardware_performance_matrix.py from staged hardware_<machine>_perf.log
files only.
Use bash qemu/run_hardware_capture_486.sh first to rehearse the same
C:\GPT2\HWVALID.BAT capture path in FreeDOS before transferring it to a
physical machine.
Build the DOSBox-ready convenience bundle with:
python3 scripts/build_dosbox_bundle.py --forceThat writes /private/tmp/gpt2-basic-dosbox.zip plus a .sha256 sidecar. The
bundle contains the short GPT2 DOS tree, DOSBox config profiles under
DOSBOX/, CWSDPMI.EXE for FreeBASIC protected-mode execution, and host
launchers such as run-chat.sh, run-completion.sh, run-demo.sh, and
RUNDEMO.BAT.
See docs/dosbox.md for the supported profiles. DOSBox is a
demo/smoke-test path; QEMU remains the preview release gate and physical 486
captures remain the solid-release gate.
After rebuilding the preview zip, DOSBox zip, hardware-transfer zip, and generated launch videos, build the launch handoff kit with:
python3 scripts/build_launch_kit.py --forceThe promo video builder requires Pillow plus ffmpeg; install the Python
dependency in the environment used for launch-media generation with
python3 -m pip install -r requirements-promo.txt.
The launch kit writes /private/tmp/gpt2-basic-launch-kit.zip plus a .sha256
sidecar for local rebuilds. The public preview release attaches the same
archive as gpt2-basic-launch-kit.zip. It bundles the verified release
payloads, DOSBox payload, hardware-transfer payload, preview manifest, generated
MP4 launch clips, thumbnail, release notes, public launch plan, and reusable
promotional copy into one deterministic handoff archive.
Build the minimal DOS transfer bundle with:
python3 scripts/build_hardware_transfer.py --forceThe builder writes /private/tmp/gpt2-basic-hardware-transfer.zip plus a
.sha256 sidecar. Both release zip builders use deterministic archive metadata,
and the preview manifest uses a pinned release date plus portable artifact
names by default, so identical no-change rebuilds produce the same zip checksum
without embedding host output paths. CI also rebuilds the preview package under
a second output root and requires the same zip checksum. The preview builder
also refuses untracked files in copied release-input roots so local scratch
files cannot alter the zip. The hardware-transfer builder applies the same
tracked-input rule to its model, pack, executable, hardware, and staged-source
inputs. Pass
--generated-date YYYY-MM-DD only for a deliberate release respin.
Verify both release archives before publishing:
python3 scripts/verify_preview_artifacts.pyThat verifier checks package checksums, zip sidecars, extracted zip payloads,
live tree versus zip payload consistency, required DOS demo files, the exact
six release model directories, assistant packs, absence of deferred media/VM
payloads and transient host-cache artifacts, SHA-256 field syntax, duplicate
entries, manifest path safety, strict POSIX path normalization, canonical
checksum ordering, plus the 8.3-safe hardware transfer manifest. It also
enforces deterministic ZIP entry metadata.
The release notes include a consumer-side command that runs the same verifier
from inside an extracted gpt2-basic-preview tree against both downloaded zips
and sidecars.
Check repository tracking hygiene before cutting or tagging a release:
python3 scripts/verify_workspace_tracking.pyThat guard rejects untracked files and any ignored workspace path outside the documented local-runtime buckets, so scratch data cannot silently become part of the release decision process.
Write the current improvement backlog with:
python3 scripts/write_improvement_backlog.pyThat keeps the preview release, model-quality repair queue, runtime work, assistant pack work, Windows/OS2 shell path, and real-hardware validation tied to the same evidence inventory. For the current preview, the release scope is the DOS demo and DOS transfer package; the OS/2/Warp package stays deferred to a later release.
Rank actual trainer architecture profiles with:
python3 scripts/architecture_profile_sweep.pyThis report includes missing profiles as explicit planning rows and uses
profile-specific DOS --quality-all and --perf evidence when available. The
current promoted default is still the 486sx-safe shape, but now with the
4096-token lexicon vocabulary selected from measured quality evidence.
Run the dedicated DOS performance contract with:
bash qemu/run_perf_486.sh 486dx2-66This boots FreeDOS, runs C:\GPT2.EXE --perf, extracts C:\PERF.LOG, and writes a parseable report to qemu/evidence/hardware_perf_report.md. Pass a model directory as the second argument to stage a non-active profile, for example bash qemu/run_perf_486.sh 486dx2-66 assets/gpt2_basic/MODEL_PROFILE_386_MIN. The same --perf mode is what we will run on a real PC later; under QEMU it is emulated CPU-profile evidence.
To emit the per-kernel timing breakdown from inside the DOS executable, pass
kernel as the third argument:
bash qemu/run_perf_486.sh 486dx2-66 assets/gpt2_basic/MODEL kernelThe current kernel row shows the 4096-token output head taking about 73.7% of the measured decode time, making vocabulary-head compression and faster head scoring the main performance target for the large-vocabulary release.
For an approximate era-speed run, use an instruction-count throttled profile:
bash qemu/run_main_486_era.sh 486dx2-66The era-speed runner supports 386dx-33, 486sx-25, 486dx-33, 486dx2-66, 486dx4-100, pentium-60, pentium-133, and host. These are repeatable QEMU -icount approximations, not cycle-accurate models of specific boards.
The run script boots FreeDOS and launches the real C:\GPT2.EXE. Text completion now requires trained model files in C:\MODEL; if they are missing, the program refuses to present fake generated output.
The older src/dos_gpt2_basic.bas target remains in the repository only as a small diagnostic smoke test for the FreeDOS/FreeBASIC/QEMU toolchain.
The old compact prompt prior is disabled by default. The primary demo path is a trained GPT2-BASIC model exported from PyTorch and executed by the DOS BASIC runtime.
Current Trained-Model Default:
The current real-inference demo uses the promoted lexicon GPT2-BASIC checkpoint
with 2 layers, 48 embedding dimensions, 4 heads, a 192-token context window, and
a 4096-token lexicon vocabulary. The primary DOS path uses fixed-point weights
and integer inference kernels. The model has 463,168 parameters, fixed weights
of 1,852,672 bytes, and measured DOS runtime memory of about 2,055,940 bytes.
The current QEMU 486dx2-66 --perf run for this promoted model generated 127
tokens in 51.57 seconds, or 2.46 tokens/sec.
The q4/log low-memory release mode keeps the same fixed checkpoint but stores
both 196,608-value vocabulary tensors compactly: token embeddings in
GPT2TQ4.BIN and the output head in GPT2HQ4.BIN. Each tensor has 98,304
packed q4 bytes plus per-token scales. DOS dequantizes only the current token
embedding row and expands a small output-head decode table at load time, so the
compressed path keeps usable speed while saving about 1.08 MB of runtime
memory.
┌──────────────────────────────┬────────────────────┬───────────────────┬───────────────────┐
│ Configuration │ Tokens per Second │ 70-Token Demo │ 100-Token Equiv. │
├──────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
│ QEMU 386dx-33 no-FPU │ 0.31 │ 228.1 seconds │ 325.8 seconds │
│ QEMU 486sx-25 no-FPU │ 0.61 │ 114.0 seconds │ 162.9 seconds │
│ QEMU 486dx-33 │ 1.23 │ 57.0 seconds │ 81.4 seconds │
│ QEMU 486dx2-66 --perf │ 2.46 │ 28.4 seconds │ 40.6 seconds │
│ QEMU 486dx4-100 │ 4.91 │ 14.2 seconds │ 20.4 seconds │
│ QEMU pentium-60 │ 4.92 │ 14.2 seconds │ 20.3 seconds │
│ QEMU pentium-133 │ 9.85 │ 7.1 seconds │ 10.2 seconds │
│ QEMU 486dx2-66 q4 head │ 2.12 │ 33.0 seconds │ 47.1 seconds │
│ QEMU 486dx2-66 q4 tok+head │ 2.12 │ 33.0 seconds │ 47.1 seconds │
│ QEMU 486dx2-66 q4 streaming │ 0.81 │ 86.3 seconds │ 123.5 seconds │
│ Host-speed QEMU --perf │ 43.55 │ 1.6 seconds │ 2.3 seconds │
└──────────────────────────────┴────────────────────┴───────────────────┴───────────────────┘
These rows are now measured DOS GPT2.EXE --perf rows under QEMU, not just
planning estimates. The 386dx-33 row uses QEMU's 486,-fpu CPU model with
conservative instruction throttling because this QEMU build does not expose a
true 386 CPU model.
Current quality evidence for the promoted lexicon default is in
qemu/evidence/quality_report_default_model_all.md,
qemu/evidence/quality_report_default_model_fixed_all.md, and
qemu/evidence/quality_report_dos_all.md. The DOS educational trace evidence
is in qemu/evidence/trace_486.log, and the non-greedy sampling evidence is
in qemu/evidence/sampling_486.log. These are emulator and host measurements
until we can repeat the same commands on a physical PC.
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
Our implementation includes several innovative techniques that would have been considered groundbreaking optimizations in the 486 era. For complete technical details, see the core innovations section in our technical documentation:
GPT2CFG.TXT checkpoint shape and file contract
GPT2WT.BIN float32 host reference weights
GPT2FX.BIN Q20.12 signed fixed-point weights, 4 bytes/value
GPT2EXP.BIN fixed-point exp lookup table for attention softmax
GPT2TQ4.BIN optional q4/log compressed token-embedding artifact
GPT2HQ4.BIN optional q4/log compressed output-head artifact
The verified production checkpoint currently stores weights as signed Q20.12 LONG values. This is larger than packed int8/int4 formats, but it keeps the DOS runtime simple, deterministic, and parity-checkable against the host fixed-point reference.
The current measured compressed release path is token-embedding plus output-head
q4/log. It is optional because the full Q20.12 tensors are still faster, but it
is useful when RAM matters more than peak speed: the measured QEMU 486DX2/66
path saves about 53% runtime memory and gives up about 14% throughput. The
streamed output-head variant saves about 70% runtime memory and gives up about
67% throughput. Full model packed int16, int8, and broader 4-bit formats remain
architecture experiments until each one has the same host validator, DOS
loader, vector parity, quality report, and --perf timing.
Inspired by techniques from early 3D engines like Doom and Quake, we use fixed-point arithmetic throughout. This provides:
- Much faster computation than floating point on 486 hardware
- Sufficient precision for transformer computations
- Efficient implementation of mathematical operations
- Compatibility with 486SX processors lacking an FPU
For example, multiplying two fixed-point numbers looks like:
FUNCTION FixedMul(a AS LONG, b AS LONG) AS LONG
DIM result AS LONGINT
result = (CLNGINT(a) * CLNGINT(b)) \ 4096
RETURN CLNG(result)
END FUNCTION┌─────────┬─────────┬─────────┐ ┌─────────┬─────────┬─────────┐
│ X X X X │ . . . . │ . . . . │ │ │ │ │
│ X X X X │ . . . . │ . . . . │ │ BLOCK │ │ │
│ X X X X │ . . . . │ . . . . │ │ 1 │ │ │
│ X X X X │ . . . . │ . . . . │ │ │ │ │
├─────────┼─────────┼─────────┤ ├─────────┼─────────┼─────────┤
│ . . . . │ X X X X │ . . . . │ │ │ │ │
│ . . . . │ X X X X │ . . . . │ │ │ BLOCK │ │
│ . . . . │ X X X X │ . . . . │ → │ │ 2 │ │
│ . . . . │ X X X X │ . . . . │ │ │ │ │
├─────────┼─────────┼─────────┤ ├─────────┼─────────┼─────────┤
│ . . . . │ . . . . │ X X X X │ │ │ │ │
│ . . . . │ . . . . │ X X X X │ │ │ │ BLOCK │
│ . . . . │ . . . . │ X X X X │ │ │ │ 3 │
│ . . . . │ . . . . │ X X X X │ │ │ │ │
└─────────┴─────────┴─────────┘ └─────────┴─────────┴─────────┘
Dense Attention Matrix Sparse Block Representation
Attention matrices in transformers require O(n²) memory for context length n. On a 486 with just 32MB RAM, this becomes prohibitive rapidly. The repository includes block-sparse lab code for this direction:
- Divide attention matrices into fixed-sized blocks
- Use a linked-list structure to store only non-zero blocks
- Implement specialized sparse matrix multiplication
- Automatically detect when to use sparse vs. dense representation
- Achieve 50-80% memory reduction for typical patterns
This technique was inspired by sparse matrix methods used in early scientific computing and CAD software of the era. It is not the current production GPT2-BASIC decode path.
┌─────────────────┐ ┌────────────────────┐
│ Model Structure │ │ Layer 0 Parameters │
└────────┬────────┘ └──────────┬─────────┘
│ │
│ ┌───────────┐ │
└─▶│ RAM │◀──────────┘
│ (32MB max)│
└─────┬─────┘
│
▼
┌─────────────────────────┐
│ Disk Storage │
├─────────────────────────┤
│ Layer 1 Parameters │
│ Layer 2 Parameters │
│ Vocabulary │
│ ... │
└─────────────────────────┘
To handle models that exceed comfortable RAM budgets, the current production path implements streaming where the measured pressure is highest:
- Store the compressed output head on disk in
GPT2HQ4.BIN - Keep levels, scales, and one row buffer resident
- Stream packed output-head rows on demand when
GPT2HQS.ONis present - Keep the faster resident q4 mode available when memory allows it
- Measure the memory/speed tradeoff through DOS vector parity and
--perf
This approach is reminiscent of how games like Wing Commander managed to create experiences that seemed to exceed the hardware limitations of the time. Older lab code still explores broader layer-style streaming, but production claims should be tied to the measured q4/log streamed-head mode.
The release executable also has a teaching/introspection mode:
bash qemu/run_trace_486.shInside FreeDOS this runs GPT2.EXE --trace against the same C:\MODEL files
used by quality, vector, and performance evidence. The trace records model
shape, tokenizer mode, prompt pieces, each generated token, and the final
decoded text in a stable TRACE_* line format. It is intentionally text-mode
and machine-readable so it works on the same era-accurate systems as the main
demo.
Although the 486 lacks SIMD instructions, we can simulate parallel processing at the bit level:
' Pack 4 8-bit values into a single 32-bit integer
FUNCTION Pack_8bit(v1 AS BYTE, v2 AS BYTE, v3 AS BYTE, v4 AS BYTE) AS LONG
RETURN ((v1 AND &HFF)) OR _
((v2 AND &HFF) << 8) OR _
((v3 AND &HFF) << 16) OR _
((v4 AND &HFF) << 24)
END FUNCTIONThis technique lets us:
- Process multiple values in a single operation
- Reduce loop overhead
- Maximize use of the 32-bit registers
- Achieve "poor man's SIMD" years before MMX extensions
This approach draws inspiration from demoscene coding techniques, where every cycle and byte mattered.
The most performance-critical sections are implemented in optimized x86 assembly:
' Example: Assembly-optimized fixed-point multiplication
' This would use MOV, IMUL, and bit shifting instructions in actual assembly
FUNCTION AsmFixedMul(a AS INTEGER, b AS INTEGER) AS INTEGER
' In real implementation, this would be pure x86 assembly
' Simulated version for demonstration:
DIM result AS LONGINT = CLNGINT(a) * CLNGINT(b)
RETURN CINT(result >> 16)
END FUNCTIONKey optimizations include:
- Register allocation for critical loops
- CPU capability detection (FPU present?)
- Custom division and square root routines
- Loop unrolling for matrix operations
- Block-based processing for cache efficiency
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
┌──────────────────┬────────────────────┬───────────────────────┐
│ System │ Architecture │ Parameters │
├──────────────────┼────────────────────┼───────────────────────┤
│ This Project │ Transformer (GPT) │ ~1 million │
│ 1990s Expert Sys │ Rule-based │ Thousands of rules │
│ 1990s Neural Net │ Multilayer Percp. │ ~100-10,000 │
│ 1997 Deep Blue │ Search + Eval │ ~4,000 position params│
└──────────────────┴────────────────────┴───────────────────────┘
During the 486 era (early-to-mid 1990s), AI was dominated by:
- Expert Systems: Rule-based decision making
- Small Neural Networks: Typically <5 layers, <10,000 parameters
- Statistical Methods: Hidden Markov Models, Bayesian approaches
- Game-Playing Systems: Deep Blue (chess) was state-of-the-art
This implementation represents a fascinating "alternate history" - what if transformer architecture had been invented during this period? With what techniques would it have been implemented? Our alternative history impact analysis explores this counterfactual scenario in depth.
This project employs many techniques that were cutting-edge in the 486 era:
- Fixed-point arithmetic: Used in early 3D engines like Doom and Quake
- Lookup tables: Common in demoscene effects and games
- Memory streaming: Used in games like Wing Commander
- Block-based processing: Employed in early multimedia codecs
- Assembly optimization: Essential for any performance-critical software
The difference is that we're applying these vintage techniques to a modern AI architecture, creating a bridge between computing eras.
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
The production executable is intentionally narrower than the full repository:
src/main_prod.bas stages as GPT2SRC\MAIN.BAS and includes only the
tokenizer, minimal allocation accounting, the fixed-point GPT2 runtime, and the
release entrypoints. The older combined driver is still staged as LABMAIN.BAS
for experiments.
┌───────────────┐ ┌──────────────────┐ ┌───────────────────┐
│ Tokenizer │ │ Fixed-Point │ │ Model Files │
│ lexicon/byte │◀─┤ GPT Runtime │◀─┤ GPT2FX/EXP/TQ4/HQ4│
└───────────────┘ └──────────────────┘ └───────────────────┘
▲ ▲ ▲
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌──────────────────┐ ┌───────────────────┐
│ KV Cache │ │ Causal │ │ Greedy │
│ decode state │◀─┤ Attention │◀─┤ Sampling │
└───────────────┘ └──────────────────┘ └───────────────────┘
▲ ▲ ▲
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌──────────────────┐ ┌───────────────────┐
│ Quality │ │ Vector │ │ PERF_* │
│ suites │◀─┤ parity │◀─┤ timing logs │
└───────────────┘ └──────────────────┘ └───────────────────┘
▲ ▲ ▲
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌──────────────────┐ ┌───────────────────┐
│ QEMU/FreeDOS │ │ Host export │ │ Evidence │
│ runners │◀─┤ scripts │◀─┤ reports │
└───────────────┘ └──────────────────┘ └───────────────────┘
/src
├── main.bas # DOS entry point, quality/perf/vector modes
├── real_gpt.bas # verified trained GPT2-BASIC fixed-point runtime
├── tokenizer.bas # byte fallback plus DOS-loadable BPE/lexicon vocabularies
├── quality_prior.bas # disabled prompt-prior legacy path
├── data_structures.bas # shared data/config structures
├── simd_ops.bas # CPU detection and platform helpers
├── memory_manager.bas # memory accounting helpers
├── model.bas # legacy matrix transformer path
├── quantization.bas # lab 4-bit/log quantization code
├── block_sparse.bas # lab sparse attention code
├── benchmark.bas # synthetic benchmark code
└── dos_gpt2_basic.bas # small diagnostic smoke target
/scripts
├── train_gpt2_basic.py # host training/export entrypoint
├── gpt2_basic_tokenizer.py # shared byte/BPE/lexicon tokenizer contract
└── quantize_gpt2_basic.py # q4/log token/head release artifact builder
/assets/gpt2_basic/MODEL # host-exported production checkpoint
├── GPT2CFG.TXT # model shape
├── GPT2WT.BIN # float32 reference weights
├── GPT2FX.BIN # Q20.12 fixed-point weights
├── GPT2EXP.BIN # fixed-point exp lookup table
├── VOCAB.BIN # DOS tokenizer vocabulary and mode
└── GPT2VEC.TXT # parity vectors
/assets/gpt2_basic/MODEL_HEADQ4_PROD_PROBE
└── GPT2HQ4.BIN # optional q4/log output-head release artifact
/assets/gpt2_basic/MODEL_TOKHEADQ4_PROD_PROBE
├── GPT2TQ4.BIN # optional q4/log token-embedding artifact
└── GPT2HQ4.BIN # optional q4/log output-head artifact
Input Text
│
▼
┌─────────────┐
│ Tokenizer │
└─────┬───────┘
│
▼
┌─────────────┐
│ Embedding │
└─────┬───────┘
│
▼
┌─────────────┐
│ Transformer │◄────┐
│ Layer 1 │ │
└─────┬───────┘ │
│ │
▼ │ Repeat
┌─────────────┐ │ for N
│ Transformer │─────┘ layers
│ Layer 2 │
└─────┬───────┘
│
▼
┌─────────────┐
│ Output │
│ Layer │
└─────┬───────┘
│
▼
┌─────────────┐
│ Sampling │
└─────┬───────┘
│
▼
Generated Text
Our model follows the GPT-2 architecture with several modifications for efficiency:
- byte or DOS-loadable BPE/lexicon tokenizer, with the default using 4096 lexicon tokens
- 2-4 transformer layers depending on exported profile
- 32-96 embedding dimensions depending on exported profile
- 4-6 attention heads depending on exported profile
- learned position embeddings
- standard GELU feed-forward blocks
- fixed context windows from 128 to 256 tokens depending on profile
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
This implementation serves as an educational tool for understanding:
-
Core Transformer Concepts:
- Self-attention mechanisms
- Layer normalization
- Feed-forward networks
- Positional encoding
- Token embedding
-
Generation Process:
- Autoregressive text generation
- Temperature-based sampling
- Context management
-
Model Architecture:
- Weight matrices and their relationships
- Information flow through layers
- Parameter scaling considerations
The project also teaches valuable lessons in optimization:
-
Memory Efficiency:
- Quantization techniques
- Sparse representations
- Streaming from disk
-
Computational Efficiency:
- Fixed-point arithmetic
- SIMD-like operations
- Assembly optimization
- Cache-friendly algorithms
-
I/O and System Integration:
- File format design
- Memory management
- Resource streaming
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
Compile the project using FreeBASIC:
fbc -lang fb src/main_prod.bas -o gpt2_basic.exe
For optimized build (with inline assembly):
fbc -lang fb -O 2 src/main_prod.bas -o gpt2_basic.exe
gpt2_basic
You'll be presented with a main menu offering:
- Text Completion
- Chat Application
- Run Benchmarks
- System Information
- Load/Initialize Model
The text completion and chat interfaces allow you to interact with the model and configure generation parameters like temperature, top-p, and maximum output length.
From the main menu, select option 3 to run a suite of benchmarks testing various components:
- Matrix operations (standard vs. SIMD-like)
- Attention mechanisms (dense vs. sparse)
- Softmax implementation
- Full forward pass
Production model shape is controlled by the exported fixed-point checkpoint in C:\MODEL. Use the host trainer profiles for repeatable checkpoint builds:
386-min486sx-safe486dx2-usable486dx4-pluspentium-best
For testing in DOSBox, use the generated bundle:
python3 scripts/build_dosbox_bundle.py --force
cd /private/tmp/gpt2-basic-dosbox
dosbox -conf DOSBOX/GPT2DEMO.CONFThe generated configs use core=dynamic, cycles=max, memsize=64, and
mount c ., then start in C:\GPT2.
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
The older repository documentation includes synthetic component benchmarks for the lab matrix/runtime code. They are useful for background, but they are not the current production GPT2-BASIC performance claim. Current production timing should come from GPT2.EXE --perf and qemu/evidence/hardware_perf_report.md.
┌───────────────────┬───────────────────┬────────────────┐
│ Operation │ Standard Version │ Optimized │
├───────────────────┼───────────────────┼────────────────┤
│ Matrix Addition │ 124.5 ms │ 38.7 ms (3.2x) │
│ Matrix Transpose │ 32.8 ms │ 12.4 ms (2.6x) │
│ Matrix Multiply │ 156.2 ms │ 47.3 ms (3.3x) │
│ Attention │ 241.6 ms │ 86.2 ms (2.8x) │
│ Softmax │ 12.8 ms │ 5.1 ms (2.5x) │
│ Forward Pass │ 310.4 ms │ 92.7 ms (3.3x) │
│ Full Generation │ 32.5 ms/token │ 9.8 ms/token │
└───────────────────┴───────────────────┴────────────────┘
Current fixed-point results for the promoted 4096-token lexicon checkpoint:
┌──────────────────────────────┬────────────────────┬───────────────────┬───────────────────┐
│ Configuration │ Tokens per Second │ 70-Token Demo │ 100-Token Equiv. │
├──────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
│ QEMU 386dx-33 no-FPU │ 0.31 │ 228.1 seconds │ 325.8 seconds │
│ QEMU 486sx-25 no-FPU │ 0.61 │ 114.0 seconds │ 162.9 seconds │
│ QEMU 486dx-33 │ 1.23 │ 57.0 seconds │ 81.4 seconds │
│ QEMU 486dx2-66 --perf │ 2.46 │ 28.4 seconds │ 40.6 seconds │
│ QEMU 486dx4-100 │ 4.91 │ 14.2 seconds │ 20.4 seconds │
│ QEMU pentium-60 │ 4.92 │ 14.2 seconds │ 20.3 seconds │
│ QEMU pentium-133 │ 9.85 │ 7.1 seconds │ 10.2 seconds │
│ Host-speed QEMU --perf │ 43.55 │ 1.6 seconds │ 2.3 seconds │
└──────────────────────────────┴────────────────────┴───────────────────┴───────────────────┘
These are GPT2.EXE --perf measurements from FreeDOS emulation. Measure the
target PC directly before quoting board-specific speed.
The verified 486sx-safe production checkpoint currently reports 2,055,940 bytes of DOS runtime memory. The older planning table below is retained as historical design context for larger matrix/runtime configurations, not as the current measured production footprint.
┌───────────────────────────┬─────────────────┬────────────────┐
│ Configuration │ In-Memory Mode │ Streaming Mode │
├───────────────────────────┼─────────────────┼────────────────┤
│ 2-layer, 64-dim, 1K vocab │ 506 KB │ 276 KB │
│ 2-layer, 128-dim, 5K vocab│ 1.7 MB │ 582 KB │
│ 4-layer, 128-dim, 5K vocab│ 3.2 MB │ 624 KB │
└───────────────────────────┴─────────────────┴────────────────┘
┌───────────────────┬─────────────────┬────────────────┐
│ Component │ Standard │ Optimized │
├───────────────────┼─────────────────┼────────────────┤
│ Model Parameters │ 4,194,304 bytes │ 524,288 bytes │
│ Working Memory │ 2,097,152 bytes │ 524,288 bytes │
│ Attention Matrices│ 8,388,608 bytes │ 838,860 bytes │
│ Other Structures │ 1,048,576 bytes │ 262,144 bytes │
├───────────────────┼─────────────────┼────────────────┤
│ Total Peak │ 15,728,640 bytes│ 2,149,580 bytes│
└───────────────────┴─────────────────┴────────────────┘
Several limitations have been identified during implementation:
- Physical hardware evidence: QEMU
-icountmeasurements are repeatable emulator evidence, not cycle-accurate proof for a specific motherboard. - Prompt coverage: The current checkpoint passes the measured DOS held-out and runtime suites, but broader open-ended prompts still need product testing.
- Generation speed: The current QEMU
486dx2-66 --perfmeasurement is 2.46 tokens/sec for the full-resident default, 2.12 tokens/sec for the q4/log token+head release mode, and 0.81 tokens/sec for the q4/log streamed-head fallback. - Context length: Generation rolls forward after the exported context window, but attention remains limited to the active 192-token window.
- Sampling: Greedy evidence remains the deterministic release gate. Interactive fixed-point decode supports temperature, top-k, and top-p, and
qemu/evidence/sampling_486.lognow provides the DOS-side non-greedy product matrix.
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
This project is released under the MIT License. If you use this code or concepts in your work, please cite:
@misc{gpt2_basic,
author = {tsotchke},
title = {GPT-2 in BASIC: Implementing Modern Transformer Models on late 1990s 486-Era Hardware},
year = {2025},
howpublished = {\url{https://github.com/tsotchke/gpt2-basic}},
note = {Implementation of a scaled-down GPT-2-like transformer model in BASIC optimized for 486-era hardware}
}
This project is released under the MIT License. See the LICENSE file for details.
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
This project stands at the fascinating intersection of modern AI and retrocomputing, demonstrating that the fundamental algorithms powering today's most advanced language models could theoretically have been implemented decades earlier. The current QEMU 486DX2/66 evidence is no longer only theoretical: the promoted fixed-point DOS runtime produces useful short completions at 2.46 tok/s in the full-resident mode, 2.12 tok/s in the low-memory q4/log token+head mode, and 0.81 tok/s in the streamed-head fallback.
The journey of implementing GPT-2 in BASIC reveals several profound insights:
-
Algorithmic Essence: When stripped of GPU optimizations and specialized hardware, transformers are revealed to be fundamentally just sequences of mathematical operations—multiplication, addition, and non-linear transformations—that can be implemented on virtually any computing hardware. Our detailed technical architecture documentation demonstrates this clearly.
-
Optimization Artistry: The constraints of vintage hardware force a return to the lost art of careful optimization. Techniques that were once common knowledge among programmers—fixed-point arithmetic, bit manipulation, assembly optimization—have largely faded from mainstream programming but remain powerful approaches for constrained environments.
-
Educational Bridge: This implementation serves as a bridge between eras, helping modern AI practitioners understand the fundamental operations of transformers while teaching vintage computing enthusiasts about contemporary AI concepts. See our educational value section for more insights.
This counterfactual implementation also invites us to consider how computing history might have unfolded differently if transformer models had emerged in the early 1990s rather than the late 2010s. Would we have seen earlier development of large language models? Would hardware have evolved differently to accelerate such models? These questions remain fascinating thought experiments.
As we look to the future of AI, this backward-compatible implementation reminds us that the core algorithms driving our most advanced systems are not as mysterious or inaccessible as they might seem. By understanding these fundamentals, we're better positioned to develop the next generation of AI systems, whether they run on quantum computers or on embedded devices with constraints that make a 486 seem powerful by comparison.
In the end, this project stands as both a technical achievement and a reminder that innovation often comes from revisiting fundamental principles under new constraints.