Python source

Source code for llms_txt Python module, containing helpers to create and use llms.txt files

Introduction

The llms.txt file spec is for files located in the path llms.txt of a website (or, optionally, in a subpath). llms-sample.txt is a simple example. A file following the spec contains the following sections as markdown, in the specific order:

  • An H1 with the name of the project or site. This is the only required section
  • A blockquote with a short summary of the project, containing key information necessary for understanding the rest of the file
  • Zero or more markdown sections (e.g. paragraphs, lists, etc) of any type, except headings, containing more detailed information about the project and how to interpret the provided files
  • Zero or more markdown sections delimited by H2 headers, containing “file lists” of URLs where further detail is available
    • Each “file list” is a markdown list, containing a required markdown hyperlink [name](url), then optionally a : and notes about the file.

Here’s the start of a sample llms.txt file we’ll use for testing:

samp = Path('llms-sample.txt').read_text()
print(samp[:480])
# FastHTML

> FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore's `FT` "FastTags" into a library for creating server-rendered hypermedia applications.

Remember:

- Use `serve()` for running uvicorn (`if __name__ == "__main__"` is not needed since it's automatic)
- When a title is needed with a response, use `Titled`; note that that already wraps children in `Container`, and already includes both the meta title as well as the H1 element

Reading

We’ll implement parse_llms_file to pull out the sections of llms.txt into a simple data structure.


source

named_re

 named_re (nm, pat)

Pattern to match pat in a named capture group


source

opt_re

 opt_re (s)

Pattern to optionally match s

We’ll work “outside in” so we can test the innermost matches as we go.

Parse sections

sections = '''First bit.

## S1

-[foo](http://foo)
- [foo2](http://foo2): stuff

## S2

- [foo3](http://foo3)'''
start,*rest = re.split(fr'^##\s*(.*?$)', sections, flags=re.MULTILINE)
start
'First bit.\n\n'
rest
['S1',
 '\n\n-[foo](http://foo)\n- [foo2](http://foo2): stuff\n\n',
 'S2',
 '\n\n- [foo3](http://foo3)']

Concisely create a dict from the pairs in rest.

d = dict(chunked(rest, 2))
d
{'S1': '\n\n-[foo](http://foo)\n- [foo2](http://foo2): stuff\n\n',
 'S2': '\n\n- [foo3](http://foo3)'}
links = d['S1']
links.strip()
'-[foo](http://foo)\n- [foo2](http://foo2): stuff'

Parse links into a list of links. There can be multiple newlines between them.

_parse_links(links)
[{'title': 'foo', 'url': 'http://foo', 'desc': None},
 {'title': 'foo2', 'url': 'http://foo2', 'desc': 'stuff'}]

Create a function that uses the above steps to parse an llms.txt into start and a dict with keys like d and parsed list of links as values.

start, sects = _parse_llms(samp)
start
'# FastHTML\n\n> FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore\'s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.\n\nRemember:\n\n- Use `serve()` for running uvicorn (`if __name__ == "__main__"` is not needed since it\'s automatic)\n- When a title is needed with a response, use `Titled`; note that that already wraps children in `Container`, and already includes both the meta title as well as the H1 element.'
title = named_re('title', r'.+?$')
summ = named_re('summary', '.+?$')
summ_pat = opt_re(fr"^>\s*{summ}$")
info = named_re('info', '.*')
pat = fr'^#\s*{title}\n+{summ_pat}\n+{info}'
search(pat, start, (re.MULTILINE|re.DOTALL))
{'title': 'FastHTML',
 'summary': 'FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore\'s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.',
 'info': 'Remember:\n\n- Use `serve()` for running uvicorn (`if __name__ == "__main__"` is not needed since it\'s automatic)\n- When a title is needed with a response, use `Titled`; note that that already wraps children in `Container`, and already includes both the meta title as well as the H1 element.'}

Let’s finish it off!


source

parse_llms_file

 parse_llms_file (txt)

Parse llms.txt file contents in txt to an AttrDict

llmsd = parse_llms_file(samp)
llmsd.summary
'FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore\'s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.'
llmsd.sections.Examples
(#1) [{'title': 'Todo list application', 'url': 'https://raw.githubusercontent.com/AnswerDotAI/fasthtml/main/examples/adv_app.py', 'desc': 'Detailed walk-thru of a complete CRUD app in FastHTML showing idiomatic use of FastHTML and HTMX patterns.'}]

XML conversion

For some LLMs such as Claude, XML format is preferred, so we’ll provide a function to create that format.


source

get_doc_content

 get_doc_content (url)

Fetch content from local file if in nbdev repo.


source

mk_ctx

 mk_ctx (d, optional=True, n_workers=None)

Create a Project with a Section for each H2 part in d, optionally skipping the ‘optional’ section.

ctx = mk_ctx(llmsd)
print(to_xml(ctx, do_escape=False)[:260]+'...')
<project title="FastHTML" summary='FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore&#39;s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.'>Remember:

- Use `serve()` for running uvic...

source

get_sizes

 get_sizes (ctx)

Get the size of each section of the LLM context

get_sizes(ctx)
{'docs': {'internal docs - ed': 34464,
  'FastHTML quick start': 27383,
  'HTMX reference': 26812,
  'Starlette quick guide': 7936},
 'examples': {'Todo list application': 18558},
 'optional': {'Starlette full documentation': 48331}}
Path('../fasthtml.md').write_text(to_xml(ctx, do_escape=False))
164662

source

create_ctx

 create_ctx (txt, optional=False, n_workers=None)

A Project with a Section for each H2 part in txt, optionally skipping the ‘optional’ section.


source

llms_txt2ctx

 llms_txt2ctx (fname:str, optional:<function bool_arg>=False,
               n_workers:int=None, save_nbdev_fname:str=None)

Print a Project with a Section for each H2 part in file read from fname, optionally skipping the ‘optional’ section.

Type Default Details
fname str File name to read
optional bool_arg False Include ‘optional’ section?
n_workers int None Number of threads to use for parallel downloading
save_nbdev_fname str None save output to nbdev {docs_path} instead of emitting to stdout
!llms_txt2ctx llms-sample.txt > ../fasthtml.md